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Abstract 

The  goal  of  distributed  simulation  is  to  speed  up  simulation  by  distributing 
a  simulation  model’s  execution  over  multiple  processors.  This  thesis  reviews  exist¬ 
ing  methods  for  distributed  simulation,  and  introduces  an  algorithm  for  distributed 
discrete-event  simulation,  the  distributed  event  list  algorithm,  based  on  the  Chandy- 
.Misra  algorithm,  with  an  event  list,  similar  to  that  used  in  sequential  simulation,  at 
each  logical  process.  Null  messages  are  used  for  deadlock  avoidance.  The  algorithm 
is  described,  and  is  shown  to  require  a  bounded  amount  of  memory  at  each  logical 
process. 

A  performance  analysis  of  the  distributed  event  list  algorithm  is  performed.  In 
the  analytical  portion,  a  linear  event  list  implementation  is  shown  to  be  of  super- 
linear  time  complexity  in  relation  to  events  simulated.  This  time  complexity  implies 
theoretical  speed-up  of  greater  than  N  for  a  simulation  distributed  over  N  processors. 
This  result  contradicts  a  commonly-held  view  of  the  existence  of  a  bound  of  .N  on 
attainable  speed-up. 

Empirical  studies  evaluate  the  performance  of  the  distributed  event  list  al¬ 
gorithm  under  a  variety  of  conditions.  Speed-up  greater  than  N  are  shown  to  be 
achievable  for  certain  topologies  of  simulation  models,  confirming  the  time  complex¬ 
ity  analysis.  The  topology  of  the  simulation  model  is  shown  to  greatly  affect  the 
attained  speed-up.  Simulation  networks  with  directed  cycles  exhibit  extremely  poor 

viii 


performance,  in  agreement  with  previous  performance  studies  of  the  Chand\'-Misra 


algorithm. 

Alternate  strategies  for  sending  the  Null  messages  used  for  deadlock  avoidance 
are  compared.  Results  show  that  for  tandem  and  feed-forward  topologies,  a  certain 
level  of  Null  messages  are  beneficial  to  speed-up. 

The  problem  of  assigning  a  given  simulation  model  to  a  set  of  logical  processes 
is  addressed.  It  is  seen  that  topology  of  the  logical  system  plays  a  critical  role  in  the 
effectiveness  of  an  assignment  strategy. 


DISTRIBUTED  DISCRETE-EVENT  SIMULATION 
USING  VARIANTS  OF  THE  CHANDY-MISRA  ALGORITHM 
ON  THE  INTEL  HYPERCUBE 


/. 


Introduction 


Distributed  discrete  event  simulation  has  been  a  topic  of  intense  research  inter¬ 
est  since  the  late  1970’s.  The  principal  goal  of  distributed  simulation  is  to  reduce  the 
time  needed  to  perform  a  simulation  by  spreading  its  execution  over  multiple  pro¬ 
cessors.  This  is  accomplished  by  exploiting  the  parallelism  inherent  to  discrete-event 
simulation. 


The  time  required  to  execute  large  simulation  models  on  even  the  fastest  se¬ 
quential  computers  has  limited  the  use  of  simulation  in  several  application  domains, 
0  such  as  large-scale  digital  logic  systems,  simulation  of  weather  patterns,  and  mil¬ 

itary  applications  such  as  strategic  defense  and  conventional  battle  management. 
Distributed  simulation  is  one  possible  way  of  extending  the  usefulness  of  simulation 
into  these  and  other  areas. 


While  algorithms  for  distributed  discrete-event  simulation  have  existed  for  a 
number  of  years,  little  had  been  accomplished  until  very  recently  in  the  task  of 
evaluating  and  refining  these  algorithms,  perhaps  due  to  the  non-availability  of  cost- 
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effective  distributed  computing  systems.  Early  performance  studies  of  distributed 
simulation  algorithms  were  accomplished  by  using  uniprocessor  computers  to  simu¬ 
late  distributed  computing  systems  [CM79]. 

The  widespread  availability  of  relatively  cheap  microprocessors  has  enabled 
the  development  of  less  costly  multiprocessor  computing  systems  for  the  research 
community  and  for  general  use,  resulting  in  new  opportunities  and  incentives  for  the 
development  of  distributed  simulation  methods  [Hei86].  Recent  research  has  focused 
on  performance  evaluation  of  the  existing  algorithms,  based  on  empirical  studies 
performed  on  various  distributed  architectures  [Fuj88,  RM88,  RMM88]. 

l.t  Simulation  Overvietv 

Simulation  is  a  mathematical  tool  that  allows  the  user  to  modify  and  experi¬ 
ment  with  a  system  when  it  would  otherwise  be  impractical  to  do  so.  Simulation  has 
been  defined  by  Banks  and  Carson  as  “the  imitation  of  the  operation  of  a  real-world 
process  or  system  over  time’’  [BC84]  and  by  Shannon  as  “the  process  of  designing  a 
model  of  a  real  system  and  conducting  experiments  with  this  model  for  the  purpose 
either  of  understanding  the  behavior  of  the  system  or  of  evaluating  various  strategies 
for  the  operation  of  the  system”  [Sha75]. 

In  a  defining  a  simulation  model,  the  modeler  seeks  to  capture  the  essential 
behavior  of  some  physical  system  to  a  certain  level  of  abstraction.  The  modeler  im¬ 
plements  the  model  by  constructing  a  simulation  computer  program.  In  doing  so.  the 
model  must  be  defined  in  terms  of  functions  that  are  computable,  necessitating  the 
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adoption  of  a  particular  paradigm  or  view  of  the  physical  systems  to  be  simulated. 
The  modeler  can  implement  the  model  in  a  general-purpose  programming  language 
such  as  FORTRAN  or  Ada,  or  by  using  a  specialized  simulation  language  such  as 
GASP  [Pri74],  SLAM  II  [Pri86],  and  many  others.  Simulation  languages  save  a  large 
amount  of  programming  effort  by  providing  pre-defined,  standard  functions  for  con¬ 
trolling  a  simulation,  calculating  statistics,  etc.,  allowing  the  modeler  to  concentrate 
on  the  abstraction  of  the  simulated  system  [LK82]. 

Discrete-event  simulation  refers  to  simulation  of  a  system  whose  state  changes 
only  at  a  countable  number  of  points  in  time.  The  majority  of  discrete-event  simula¬ 
tion  models  are  stochastic  in  nature,  having  one  or  more  random  variables  as  inputs. 
The  outputs  of  such  models  are  also  random  variables,  and  statistical  methods  must 
be  used  in  their  analysis  [Sha75]. 

Most  discrete-event  simulation  is  characterized  as  being  event-driven,  where  an 
event  is  an  instantaneous  occurrence  which  may  change  the  state  of  a  system  [BC84]. 
The  event-driven  method  of  discrete-event  simulation  advances  simulated  time  in 
irregular  intervals  defined  by  the  time  of  occurrence  of  each  simulated  event.  An 
alternate  discrete-event  simulation  method,  time-driven  discrete-event  simulation, 
advances  simulated  time  in  regular  intervals  and  simulates  each  event  when  the  time 
of  its  occurrence  has  been  reached  [LK82]. 

In  a  typical  sequential  implementation  of  the  event-driven  approach  to  simu¬ 
lation,  an  “event  list”  is  maintained  of  events  that  have  been  scheduled  to  occur. 
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The  simulation  advances  by  simulating  the  imminent  event  in  the  event  list;  i.e.,  the 
event  with  the  earliest  simulation  time  associated  with  it.  This  event  is  removed 
from  the  event  list  and  the  simulation  clock  is  advanced  to  the  time  of  the  event. 
Simulating  an  event  may  change  the  values  of  variables  that  describe  the  state  of 
the  system,  and,  in  addition,  may  cause  new  events  to  be  added  to  the  event  list 
[BC84].  Discrete-event  simulations  are  generally  designed  to  terminate  when  the 
simulation  clock  reaches  a  certain  value,  or  when  the  number  of  occurrences  of  some 
event  reaches  a  pre-defined  limit  [LK82j. 

1.2  Distributed  Processing  Overview 

There  is  no  consensus  concerning  the  exact  definition  of  a  distributed  process¬ 
ing  system.  Is  the  “distribution”  necessarily  geographic  in  nature,  or  does  it  simply 
ref'!r  to  the  logical  distribution  of  processing?  J.P.  Verjus  prefers  a  strict  definition 
based  on  geographic  distribution  of  processing  resources,  characterizing  a  distributed 
system  as  “a  set  of  separate  sites  ...  interconnected  by  communications  channels” 
[PV83].  While  some  definitions  classify  distributed  processing  as  a  form  of  parallel 
processing,  Hwang  and  Briggs  maintain  a  distinction.  They  assert  that  “parallel 
processing  and  distributed  processing  are  closely  related,”  although  as  data  com¬ 
munications  technology  advances,  “the  distinction  between  parallel  and  distributed 
processing  becomes  smaller  and  smaller”  [HB84]. 
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One  distinctive  feature  of  distributed  systems  seems  to  be  that  no  shared  mem¬ 
ory  is  allowed  [Sch82].  A  distributed  system  has  been  defined  by  D.  Herman  as  ‘‘a  set 
of  co-operating  processes  installed  on  a  multiprocessor  architecture  without  a  com¬ 
mon  memory  [PV83].”  The  preceding  definition  of  a  distributed  system  shall  be  used 
for  the  remainder  of  this  paper,  since  it  appears  to  describe  the  essential  features  of 
a  distributed  system  without  the  arbitrary  constraint  of  geographic  separation. 

A  distributed  program,  then,  is  a  collection  of  cooperating  (or  communicat¬ 
ing)  processes.  Because  one  of  the  goals  of  a  distributed  system  is  to  increase  the 
throughput  over  that  available  in  a  non-distributed  system,  it  is  desirable  to  ex¬ 
ploit  the  maximum  amount  of  concurrency.  This  is  accomplished  by  allowing  the 
processes  to  run  asynchronously,  communicating  only  as  required  [Sch82|.  Hence, 
synchronization  of  communicating  processes  is  a  major  part  of  distributed  computing 
algorithms. 

C.A.R  Hoare  has  introduced  a  notation  and  paradigm  for  the  specification  and 
operation  of  communicating  sequential  processes  [Hoa78].  His  paradigm  is  designed 
to  allow  proofs  of  correctness,  to  limit  memory  requirements,  and  to  eliminate  non¬ 
determinism  to  the  maximum  extent  possible  [Hoa85].  Many  others,  most  notably 
Dijkstra.  have  developed  specific  algorithms  relating  to  the  correct  operation  of  com¬ 
municating  processes  [Lam78,  DS80,  Sch82]. 

In  partitioning  a  task  into  a  system  of  communicating  sequential  processes,  sev¬ 
eral  design  criteria  are  relevant.  Defining  the  processes  in  such  a  way  as  to  balance 
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the  processing  load  will  maximize  throughput,  assuming  homogeneous  processors  and 
excluding  interprocessor  communications.  On  the  other  hand,  interprocess  commu¬ 
nication  carries  with  it  an  overhead  cost.  While  we  would  expect  linearly-increasing 
throughput  with  additional  processors,  communication  overhead  eventually  leads  to 
the  saturation  effect.  This  effect,  similar  to  “thrashing”  in  early  memory  paging 
systems,  can  cause  throughput  to  decrease  with  each  additional  processor  applied  to 
a  problem  [CH'*‘80]. 

The  throughput  advantage  of  a  parallel  computation  is  often  quantified  b}' 
the  Speed-up  factor,  which  is  the  ratio  of  the  time  taken  to  execute  a  computation 
on  a  single  processor  to  the  time  taken  to  execute  the  same  computation  in  the 
parallel  system  [Fuj88].  The  efficiency  of  a  distributed  implementation  is  described 
by  the  ratio  of  Speed-up  to  N,  where  N  is  the  number  of  concurrent  processes  in  the 
distributed  system.  If  the  same  algorithm  is  used  in  both  the  single  and  multiple 
processor  computations,  perfect  efficiency  is  considered  to  be  1.0,  corresponding  to 
a  Speed-up  factor  of  N  [Hei86]. 

1.3  Problem  Statement 

In  a  simulation  of  sufficient  size  and  complexity  to  merit  the  use  of  distributed 
computing  techniques,  it  is  likely  that  the  system  to  be  modelled  will  consist  of  a 
larger  number  of  components,  or  physical  processes,  than  there  are  processors  avail¬ 
able  in  the  distributed  computing  system.  Even  when  this  is  not  the  case,  it  is 
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preferable  in  some  instances  to  distribute  a  simulation  over  a  subset  of  available 


processors,  as  in  Heidelberger’s  method  of  concurrently  performing  simulation  repli¬ 
cations,  achieving  speed-up  equal  to  the  number  of  processors  [Hei86]. 


In  cases  such  as  these,  it  is  desirable  to  have  an  efficient  algorithm  for  simu¬ 
lating  multiple  physical  processes  on  a  single  processing  node.  A  natural  crmdidate 
algorithm  is  the  future  events  list  method  used  in  sequential  discrete-event  simula¬ 
tion.  The  concept  of  combining  a  conservative  distributed  simulation  algorithm  with 
an  event  list  at  each  logical  process  is  not  a  new  one.  Bryant’s  algorithm  for  dis¬ 
tributed  simulation  included  such  a  fusion  of  distributed  and  sequential  simulation 
concepts  [Bry79].  Authors  of  subsequently  published  algorithms,  however,  gener¬ 
ally  have  chosen  either  to  extend  the  message- passing  paradigm  within  each  logical 
process,  or  not  to  address  the  issue. 


In  addition,  many  algorithms  for  distributed  discrete-event  simulation  require 
the  assumption  of  a  particular  “world- view”  or  paradigm  of  the  physical  system  be¬ 
ing  simulated.  However,  we  may  not  wish  to  modify  our  abstraction  of  the  physical 
system  in  order  to  accommodate  the  implementation  of  our  simulation  model  on  a 
particular  computing  system.  A  desirable  method  for  distributed  simulation  will  al¬ 
low  distributed  simulation  with  a  physical  system  paradigm  based  on  the  widely-used 
event-oriented  view,  easing  the  problems  associated  with  parallelizing  a  simulation 
model  that  has  been  developed  for  sequential  simulation. 
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It  is  the  goal  of  this  thesis  to  present  such  a  method  for  distributed  discrete- 


event  simulation,  the  distributed  event  list  algorithm,  and  to  describe  the  conditions 
under  which  the  algorithm  can  be  expected  to  efficiently  provide  significant  speed¬ 
up  of  a  discrete-event  simulation. 

An  event  list  structure  at  each  logical  process  ensures  chronological  execution  of 
events.  It  will  be  shown  that  under  certain  conditions  with  a  simulation  distributed 
over  N  processors,  the  distributed  event  list  algorithm  can  yield  speed-up  greater 
than  N.  Speed-up  equal  to  the  number  of  processors  has  often  been  asserted  in 
the  literature  to  be  the  maximum  achievable  speed-up  for  a  distributed  simulation 
[Hei86,  RMM88]. 

The  distributed  event  list  algorithm  uses  a  method  of  interprocess  commu¬ 
nication  and  synchronization  based  on  the  Null  message  algorithm  for  distributed 
simulation  proposed  by  Chandy  and  Misra  [CM79].  Several  variants  on  the  ba¬ 
sic  communications  method  are  presented.  The  effects  of  the  topology  of  physical 
process  interconnection  and  the  computational  intensity  of  event  processing  on  the 
resultant  speed-up  are  explored. 

1.4  Scope 

This  thesis  includes  a  review  of  methods  for  performing  distributed  discrete- 
event  simulation  in  Chapter  2,  and  develops  a  particular  method,  the  distributed 
event  list  algorithm  with  several  variants,  based  on  the  conservative  synchronization 
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protocol  proposed  by  Chandy  and  Misra,  in  Chapter  3.  An  analytical  performance 
analysis  of  this  algorithm  is  presented  and  supported  with  empirical  studies  in  Chap¬ 
ter  4.  Chapter  5  summarizes  the  results  of  the  analysis  and  studies  to  provide  insight 
into  the  conditions  under  which  the  algorithm  can  be  expected  to  provide  significant 
Speed-up,  including  comparisons  among  the  variants. 
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II. 


Literature  Review 


2.1 


Parallelism  in  the  Simulation  Process 


Kaudel  identifies  three  kinds  of  parallelism  in  the  simulation  process,  each  of 
which  can,  at  least  theoretically,  be  exploited  to  speed  up  simulation.  Application 
level  parallelism  consists  of  executing  multiple  simulation  trials  concurrently.  Support 
function  distribution  utilizes  parallel  processing  in  the  computations  required  by 
simulation  overhead  functions,  while  retaining  the  overall  sequential  nature  of  the 
simulation.  Model  function  distribution  consists  of  the  spatial  decomposition  and 
parallelization  of  a  single  simulation  model.  Of  the  three,  model  function  distribution 
has  by  far  received  the  most  attention  in  distributed  simulation  research.  Kaudel 
asserts  that  these  three  approaches  could  be  applied  simultaneously  to  a  discrete- 
event  simulation  [Kau87]. 


2.1.1  Application  Level  Parallelism  Application  level  parallelism  takes  ad¬ 
vantage  of  the  multiple  trials  that  are  generally  performed  in  simulation  experiments. 
Execution  of  a  stochastic  simulation  model  is  normally  replicated  in  order  to  obtain 
reasonable  confidence  interval  estimates  of  output  parameters  [LK82].  In  addition, 
simulation  experiments  are  usually  conducted  for  the  purpose  of  comparing  two  or 
more  alternatives  for  a  system's  operation.  A  model  is  constructed  for  each  alterna¬ 
tive  and  run  (with  each  model  execution  replicated  as  above)  [Sha75].  Application 
level  parallelism  achieves  speed-up  by  concurrently  executing  the  individual  trials  on 
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separate  processors.  Running  N  replications  concurrently,  the  theoretical  speed-up 
factor  is  N,  achieving  what  is  considered  “perfect”  efficiency,  excluding  the  negligible 
overhead  of  the  control  functions  [Hei86]. 

Biles  et  al  discuss  distributing  simulation  at  the  application  level  using  a  hi¬ 
erarchical  network  of  microcomputers,  combining  application  level  parallelism  and 
support  function  distribution  [BD085].  In  Biles’  method,  a  “tree”  configured  net- 
w’ork  of  microcomputers  is  used,  with  the  individual  simulation  trials  performed  at 
the  lower  levels  of  the  tree  and  statistical  analysis,  optimization,  and  control  func¬ 
tions  performed  at  successively  higher  levels. 

Heidelberger  provides  a  theoretical  basis  for  computing  the  relative  efficiency 
of  distributing  each  simulation  trial  over  multiple  processors  versus  executing  par¬ 
allel  independent  replications,  given  a  stochastic  simulation  [Hei86].  Heidelberger’s 
analysis  shows  that  certain  statistical  considerations,  such  as  initialization  bias,  are 
important  in  determining  the  optimal  combination  of  model  function  distribution 
and  application  parallelism  for  achieving  the  desired  accuracy  in  the  minimum  time. 

The  uncomplicated  approach  of  concurrently  executing  independent  simula¬ 
tion  trials  has  received  scant  attention  in  distributed  simulation  literature,  possibly 
because  it  is  so  straightforward,  and  a  product  of  statistical  considerations  rather 
than  computer  science.  Kaudel  comments  as  late  as  1987  that  the  use  of  applica¬ 
tion  level  parallelism  is  untested  [Kau87j.  Application  lev'el  parallelism  appears  to 
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offer  a  significant  source  of  concurrency  for  simulation,  although  its  utility  naturally 
depends  on  the  specific  application  involved. 

^.J.2  Support  Function  Distribution  Support  function  distribution  exploits 
the  parallelism  available  by  functionally  distributing  a  simulation.  Model-  indepen¬ 
dent  support  functions,  such  as  accumulating  statistics,  managing  the  event  set, 
generating  pseudorandom  numbers,  etc.,  often  comprise  a  large  portion  of  the  re¬ 
quired  computation  in  a  simulation,  up  to  80%  in  one  simulation  analyzed  by  J. 
Comfort  [Com82].  Comfort  achieved  a  maximum  speed-up  factor  of  1.4  for  one 
benchmark  using  a  PDP-11  host  with  three  Motorola  MC68000  microprocessor  sys¬ 
tems  serving  as  a  pipelined  event  set  processor.  The  addition  of  more  processors  to 
the  pipeline  had  negligible  effect  [Com82].  In  additional  studies.  Comfort  achieved 
speed-ups  from  1 .5  to  1 .8  on  a  system  consisting  of  a  PDP- 1 1  host  processor  and  three 
MC68000  systems  added  to  provide  priority  queue  manipulation,  state  accounting, 
and  event  set  processing  [Com83].  A  method  of  pipelined  event  list  processing  for 
a  shared  memory  multiprocessor  has  been  proposed  by  D.  Jones,  but  hasn’t  been 
implemented  [Jon86a]. 

Support  function  distribution  is  certainly  limited  by  the  inherent  lack  of  paral¬ 
lelism  in  traditional  simulation  algorithms.  Comfort’s  results  indicate  that  nothing 
resembling  an  N-fold  speed-  up  can  be  expected  through  support  function  distribu¬ 
tion  [Com8.3|.  Jones  comments  that,  hypothetically,  support  function  distribution 
could  be  used  in  conjunction  with  other  forms  of  parallelism  to  increase  speed-up 
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[Jon86a].  Kaudel  notes  that  the  modest  speed-ups  achieved  with  support  function 
distribution  might  prove  it  to  be  a  less  efficient  alternative  to  the  other  distributed 
simulation  methods  [Kau87]. 

2.1.3  Model  Function  Distribution  The  vast  majority  of  distributed  simula¬ 
tion  literature  deals  with  methods  of  distributing  the  event  routines  of  a  simulation 
model  over  multiple  processors.  Indeed,  most  authors  do  not  address  any  other 
possibilities  for  distributed  simulation.  Model  function  distribution  exploits  the  par¬ 
allelism  of  the  simulated  system  to  determine  which  events  in  the  system  can  be 
simulated  concurrently.  Kaudel  notes  that  distribution  of  model  functions  gener¬ 
ally  implies  the  homogeneous  distribution  of  simulation  support  functions  as  well 
[Kau87].  For  the  remainder  of  this  paper,  distributed  simulation  is  synonymous  with 
the  model  function  distribution  of  simulation. 

The  traditional  sequential  approach  to  simulation  does  not  lend  itself  well  to 
parallelization  because  of  its  frequent  manipulation  of  a  single  data  structure,  the 
future  events  list  [CM81,  RMM88].  The  future  events  list  (or  simply  event  list) 
provides  an  ordering  of  all  events  in  a  simulation,  enforcing  a  totally  sequential 
view  of  the  behavior  of  a  physical  system  that  typically  exhibits  some  degree  of 
concurrency. 

To  achieve  the  maximum  parallelization  of  a  simulation,  the  ordering  of  events 
is  restricted  to  that  dictated  by  event  dependencies  within  the  simulated  system.  If 
an  event  B  depends  on  event  A,  then  event  A  must  be  simulated  before  event  B. 


However,  if  the  two  events  are  independent,  they  may  be  simulated  in  any  order,  or 
concurrently.  The  dependency  relation  formalizes  our  intuitive  understanding  of  tlie 
order  in  which  events  have  to  occur  in  a  simulated  system.  Over  an  entire  simulation, 
the  dependency  relationship  forms  an  irreflexive  partial  order  of  the  simulation's 
events  [Mis86].  This  was  recognized  by  Lamport,  who  devised  a  synchronization 
protocol  that  preserves  event  dependencies  within  systems  of  distributed  processes 
[Lam78]. 

Algorithms  for  distributing  a  simulation  model  over  multiple  processors  typi¬ 
cally  partition  the  simulation  among  several  communicating  processes,  in  which  each 
process  models  a  portion  of  the  overall  system  and  communicates  with  other  pro¬ 
cesses  by  passing  messages.  These  processes  are  most  often  referred  to  as  logical  pro¬ 
cesses  [CM79]  or  objects  [BJ8.5].  It  is  interesting  to  note  that  an  object-oriented  view 
of  sequential  simulation  using  message- p«issing  has  previously  been  implemented,  due 
to  software  engineering  considerations,  by  several  simulation  languages,  most  notably 
Simula-67  [DN66]. 

2.2  Distr ibuted  Simulation  Algorithms 

Distributed  simulation  algorithms  differ  cis  to  their  methods  of  interprocess 
synchronization  and  degree  of  centralized  control.  Most  distributed  simulation  al¬ 
gorithms  use  some  method  of  blocking  and  unblocking  the  execution  of  the  logical 
processes  for  inter-  process  synchronization  [CM79,  Bry79,  Rey83],  although  an  al- 


teinate  method,  to  be  discussed  later,  has  been  proposed  [Jef85].  When  process 
blocking  is  used  to  enforce  event  ordering,  there  is  the  potential  problem  of  dead¬ 
lock  occurring  among  logical  processes,  unless  provisions  have  been  built  into  the 
algorithm  to  preclude  deadlock. 

2.2.1  Chandy-Misra  Null  Message  Algorithm  K.  Chandy  and  J.  Misra  pro¬ 
posed  one  of  the  original  algorithms  for  distributed  discrete-event  simulation  in 
[CM79].  In  their  Null  Message  algorithm,  processes  communicate  exclusiv'ely  through 
messages  to  other  processes;  there  is  no  shared  memory  or  central  controlling  process. 
The  individual  processes  run  asynchronously,  each  executing  the  same  algorithm  to 
ensure  inter-process  synchronization.  The  algorithm  uses  “Null”  messages  to  avoid 
process  deadlock,  a  problem  inherent  to  this  algorithm. 

In  describing  their  Null  Message  algorithm,  Chandy  and  Misra  define  a  model 
of  physical  systems  based  on  the  concept  of  communicating  Physical  Processes  (or 
PP’s)  [CMT9].  A  simulated  system  consists  of  a  finite  number  of  physical  processes, 
which  represent  components  of  the  system  and  communicate  exclusively  through 
messages,  each  message  associated  with  a  point  in  time. 

Messages  in  the  physical  system  are  of  the  form  (t,m),  where  t  is  the  time  of 
message  transmission  and  receipt,  and  m  is  the  message  content.  Messages  are  sent 
between  any  two  PP’s  in  order  of  increasing  t  value. 
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Chandy  and  Misra  discuss  two  important  properties  of  physical  systems:  re¬ 
alizability  and  predictability.  All  physical  systems  have  the  property  of  realizability, 
which  states  that  ".4  message  sent  by  a  PP  at  time  t  is  a  function  of  its  irdtial  .-^late. 
t.  and  the  messages  it  has  received  up  to  and  including  t.”  In  addition,  physical 
systems  have  the  property  of  predictability.  Predictability  ensures  that  “the  output 
of  any  PP  up  to  any  time  t  can  be  computed  given  the  initial  state  of  the  system” 
[Mis86]. 

.4  physical  system  of  N  PP’s  can  be  simulated  by  constructing  a  simulator 
consisting  of  N  asynchronous  Logical  Processes  (LP’s),  in  which  LPi  simulates  PP,. 
In  the  logical  system,  there  is  a  communications  channel  from  LPi  to  LPj  if  and  only 
if  PP;  sends  messages  to  PPj  in  the  physical  system.  Messages  are  assumed  to  be 
transmitted  correctly  using  an  unspecified  communications  protocol  [CM79|. 

Process  synchronization  is  accomplished  by  permitting  a  process  to  advance 
its  simulation  clock  only  when  it  is  certain  that  the  process  will  receive  no  messages 
with  a  timestamp  value  less  than  the  new  time  on  the  process  simulation  clock.  The 
t  value  of  the  last  message  transmitted  over  a  channel  is  its  channel  clock  value.  An 
LP  may  advance  its  simulation  clock  to  the  minimum  of  all  its  incoming  channel 
clocks.  This  ensures  that  no  message  will  arrive  to  cause  an  incorrect  sequence  of 
events  at  any  LP. 

The  basic  algorithm  as  described  is  subject  to  the  problem  of  process  deadlock 
[CM79.  CM8l|.  In  a  cyclic  network,  a  cycle  of  LP’s  with  the  same  simulation  tiin(' 
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may  occur,  so  that  in  effect  each  LP  waits  for  a  message  that  only  it  can  provide. 
This  deadlock  may  be  avoided  by  requiring  at  least  one  LP  in  each  cycle  to  have  some 
positive  delay  time  between  receiving  a  message  and  sending  a  message  [CM79]. 

Another  type  of  deadlock  [CM79],  may  occur  even  in  acyclic  networks  using 
this  algorithm.  In  order  to  advance  its  simulation  clock,  a  LP  must  wait  for  messages 
on  all  incoming  channels  whose  channel  clock  values  are  equal  to  the  v'alue  of  the  LP 
simulation  clock.  If  no  messages  ever  arrive  on  one  or  more  of  these  channels,  the 
LP  is  blocked,  and  can  not  progress  in  its  simulation. 

Null  messages  are  the  mechanism  used  to  avoid  this  possibility  of  deadlock.  A 
Null  message  notifies  an  LP  not  to  expect  a  “real”  message  over  a  channel  up  to  a 
given  point  in  time. 

Its  effect  is  to  advance  the  channel  clock  of  the  channel  on  which  it  is  sent, 
allowing  the  recipient  LP  to  advance  its  clock.  The  Null  message  has  no  other  effect 
on  the  state  of  the  system,  as  the  message  content  is  “Null.” 

Chandy  and  Misra  offer  proof  of  correctness  of  their  Null  message  algorithm  by 
proving  “1)  chronology  of  the  (message)  tuple  sequence,  2)  correctness  of  every  tuple 
sequence  at  any  point  in  simulation,  3)  absence  of  deadlock,  and  4)  termination  of 
simulation”  [CM79]. 

2.2.2  Bryant ’s  Infinite  Buffers  Algorithm  R.  E.  Bryant  developed  a  conserva¬ 
tive  algorithm  for  distributed  simulation,  concurrently  with,  but  independent  of,  the 
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early  work  of  Chandy  and  Misra  [Bry79].  Bryant's  algorithm  is  similar  to  the  .Vnll 
Message  algorithm  in  many  respects.  There  are,  however,  significant  differences  in 
the  communications  protocols  and  time  advance  mechanisms.  The  appellation  "In¬ 
finite  Buffers'’  [Rey83]  was  given  to  Bryant’s  algorithm  because  there  is  no  way  of 
knowing  a  priori  the  amount  of  memory  that  a  process  in  the  simulation  may  re¬ 
quire.  This  is  due  to  the  absence  of  flow  control  in  the  inter-process  communications 
protocol  [Kau87]. 

Bryant’s  algorithm  is  based  on  a  paradigm  of  autonomous  processes,  which, 
as  in  the  Null  Message  algorithm,  communicate  through  timestamped  messages. 
Messages  between  two  processes  are  assumed  to  arrive  in  the  order  that  they  were 
sent,  but  not  necessarily  in  the  order  that  the  corresponding  messages  would  be 
sent  in  the  physical  system.  Process  interactions  are  represented  by  timestamped 
■‘stimulus”  messages  sent  between  processes. 

Bryant’s  algorithm,  unlike  the  Null  Message  algorithm,  queues  up  the  incoming 
stimulus  messages  in  a  future  events  list  within  each  logical  process.  This  feature 
potentially  allows  complex  physical  processes  to  be  encapsulated  within  a  single 
logical  process.  An  event  in  Bryant’s  algorithm  consists  of  three  steps; 

1.  A  stimulus  message  is  received  (or  dequeued); 

2.  a  new  state  of  the  process  is  computed  based  on  the  old  state  and  the  nature 
and  simulation  time  of  the  stimulus;  and 
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3.  some  number  (possibly  zero)  of  stimulus  messages  are  sent  to  other  processes 

and  possibly  to  the  sending  process  itself  [Bry79]. 

Stimulus  messages  are  the  method  by  which  the  simulation  changes  its  state, 
but  they  do  not  serve  as  a  time  advance  mechanism.  Each  process  in  the  simulation 
maintains  its  own  simulation  clock,  and  can  simulate,  in  time  order,  any  stimulus 
messages  with  timestamp  value  less  than  or  equal  to  the  value  of  the  process  clock. 
Bryant’s  algorithm  utilizes  two  mechanisms  for  advancing  the  process  simulation 
clocks  -  '‘time  incrementation”  and  “time  acceleration”  [Bry79]. 

Time  incrementation  uses  a  special  type  of  message  called  an  increment  mes¬ 
sage  that  does  not  simulate  a  message  sent  in  the  physical  system,  but  instead 
conveys  synchronization  information,  similar  to  the  operation  of  a  null  message  in 
Chandy  and  Misra’s  algorithm  [CM79].  Because  stimulus  messages  might  not  be 
sent  in  chronological  order,  they  can’t  be  used  for  synchronization.  When  a  process 
receives  an  increment  message  over  some  channel,  the  process  “knows”  it  will  receive 
no  earlier  stimulus  messages  over  that  channel,  and  can  increment  its  channel  clock 
to  the  time  of  the  increment  message.  The  logical  process  updates  its  own  clock 
in  a  manner  similar  to  that  used  in  the  Null  Message  algorithm,  simulates  pending 
events,  and  sends  its  own  increment  message  over  all  of  its  output  channels. 

Thus,  increment  messages  avoid  acyclic  deadlock  in  a  manner  similar  to  the 
way  null  messagCLi  perform  in  the  Chandy-Misra  algorithm  [CM79].  Cyclic  deadlocks 
are  avoided  by  requiring  that  for  every  set  of  processes  in  a  cycle,  at  least  one  has  a 
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delay  >  0,  where  delay  is  the  minimum  possible  simulated  time  difference  between 
receipt  of  a  stimulus  and  the  resulting  output  stimulus. 

Time  acceleration  is  an  additional  synchronization  mechanism  that  expedites 
time  advance  within  cyclic  portions  of  the  simulation.  Peacock  et  al  noted  that  signif¬ 
icantly  high  message  overhead  and  inefficiency  may  result  when  cycles  of  processes 
in  a  distributed  simulation  network  are  left  to  iteratively  increment  their  simula¬ 
tion  clocks  by  the  minimum  values,  characterized  as  the  “pseudo- time-driven  effect" 
[P\VM79a]. 

Time  acceleration  requires  an  analysis  of  the  system’s  interconnections  a  priori, 
in  order  to  identify  the  cyclic  portions.  The  simulation  is  partitioned  into  a  set 
of  equivalence  classes,  where  the  members  of  each  class  form  a  cycle  within  the 
simulation  network,  or  the  class  consists  of  an  individual  process  which  belongs  to 
no  cycles.  For  each  cycle  in  the  simulation  network,  time  acceleration  identifies  the 
earliest  possible  time  of  the  next  event  in  the  cycle,  and  advances  the  clocks  of  all 
processes  in  the  cycle  to  that  time. 

Time  acceleration  is  implemented  by  arbitrarily  selecting  a  process  in  each 
cyclic  class  to  send  out  periodic  test  messages  to  all  processes  in  its  class  to  which  it 
has  a  channel.  The  test  messages  circulate  through  the  communications  channels  of 
the  class.  Each  process  that  the  test  message  passes  through  updates  it,  if  necessary, 
so  that  the  test  message  always  contains  the  time  of  the  earliest  potential  event  of 
any  of  the  processes  in  its  class  that  it  has  encountered.  When  all  test  messages  have 
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returned  to  the  sending  process,  the  sending  process  takes  the  minimum  of  the  times 
contained  in  the  messages.  The  test  process  then  sends  a  wave  of  “set”  messages  to 
the  processes  in  its  class,  ordering  them  to  advance  their  clocks  to  the  new  time. 

Certain  processes  in  a  simulation  are  identified  as  source  processes,  which  send 
stimulus  messages  but  do  not  receive  any.  Termination  of  a  simulation  is  accom¬ 
plished  by  causing  each  source  process  to  send  an  increment  message  with  timestamp 
of  infinity  once  it  has  sent  all  of  its  stimulus  messages.  When  a  process  has  received 
such  a  message  over  each  of  its  incoming  channels,  it  will  receive  no  more  stimulus 
messages.  The  process  simulates  all  pending  events,  sends  out  infinity  increment 
messages  of  its  own,  and  finally  terminates. 

2.2.3  Peacock  et  al  -  Link  Time/ Blocking  Table  Algorithms  J.K.  Peacock,  J. 
Wong,  and  E.  Manning,  after  collaboration  with  Chandy  and  Misra,  published  a 
version  of  the  Null  Message  algorithm  called  the  Link  Time  algorithm  [PWM79b]. 
Peacock  et  al  don’t  address  required  memory  in  their  algorithm;  however,  there  is 
no  mention  of  processes  blocking  due  to  send  operations  [PWM79b].  The  Link  Time 
algorithm  can  then  be  considered  almost  equivalent  to  the  Null  Message  algorithm, 
without  the  flow  control  provisions  of  the  latter. 

Peacock  et  al  outline  several  other  conservative  algorithms  in  [PWM79b],  in¬ 
cluding  their  Blocking  Table  algorithm.  In  this  algorithm,  every  process  maintains  a 
list  of  other  processes  that  can  reach  it  by  a  path  of  empty  links,  and  could  there¬ 
fore  cause  its  next  event  to  be  preempted.  No  detailed  version  of  the  algorithm 


was  provided.  However,  updating  the  blocking  tables  for  each  node  appears  to  be  a 
complicated  and  arduous  task. 

A  method  for  "tight”  event-driven  simulation,  in  which  a  global  simulation 
time  is  enforced  and  the  next  event  over  all  processes  is  found  and  simulated,  as  well 
as  several  methods  for  time-driven  discrete-event  simulation  were  also  addressed  by 
Peacock  et  al  [PWM79b]. 

2.2.4  Chandy-Misra  Deadlock  Detection  and  Recovery  Algorithm  In  [CM81], 
Chandy  and  Misra  published  a  distributed  simulation  algorithm  based  on  deadlock 
detection  and  recovery,  developed  as  an  alternative  to  their  Null  Message  algorithm, 
which  had  been  shown  to  incur  a  large  message  overhead  in  simulations  with  feed¬ 
back.  The  simulation  detects  deadlock  in  a  distributed  manner  [MC82],  and  a  central 
process,  the  controller,  issues  instructions  to  the  appropriate  processes  to  initiate 
deadlock  recovery.  The  new  algorithm  was  constructed  using  the  same  simulation 
paradigm  of  message- passing  physical  processes  as  used  in  the  Null  Message  algo¬ 
rithm  [CM79,  Mis86:described  above], 

Chandy  and  Misra  characterize  simulation  as  being  one  of  a  cla^s  of  problems 
in  which  phases  of  the  problem  may  be  solved  in  parallel,  but  the  phases  them¬ 
selves  must  be  performed  in  sequence,  yielding  “a  sequence  of  parallel  computations 
[CM81].”  In  this  structure,  synchronization  is  required  only  at  phase  interfaces.  The 
termination  of  a  phase  manifests  itself  as  a  deadlock  in  the  simulation  problem,  so 
that  the  algorithm  for  each  logical  process  consists  of  the  iteration  of  the  following 


sequence:  1)  Parallel  Phase  -  Simulate  until  deadlock  occurs;  2)  Phase  Interface 
-  Initiate  computation  to  break  the  deadlock  [CM81]. 


In  the  parallel  phase,  logical  processes  (or  LP’s)  behave  similarly  to  the  asyn¬ 
chronous  logical  processes  defined  in  the  Null  Message  algorithm  [CM79],  except  of 
course  that  they  only  transmit  non-null  messages.  The  communication  protocol  as¬ 
sumes  bounded  buffers  of  an  arbitrary  size,  and  a  LP  attempting  to  send  a  message 
will  be  blocked  if  the  recipient  process’s  input  buffer  is  full.  Chandy  and  Misra  note 
that  this  protocol  can  be  used  to  ensure  that  the  sum  of  the  memory  used  by  the  all 
of  the  LP’s  is  equivalent  to  that  used  by  a  conventional  sequential  simulation  [CM81] 
(at  some  cost  to  speed-up,  naturally). 


Deadlock  detection  is  performed  in  a  distributed  manner  using  an  algorithm 
based  on  the  work  of  Dijkstra  and  Scholten  in  “Termination  Detection  for  Diffusing 
Computations”  [DS80].  Dijkstra  and  Scholten’s  algorithm  deals  with  a  generalized 
“diffusing”  computation,  in  which  computation  in  the  distributed  system  is  started 
by  an  outside  process,  the  environment,  sending  a  single  message  to  an  initiating 


process.  When  a  process  receives  its  first  message,  it  is  then  free  to  send  messages  to 
other  processes.  When  a  process  terminates,  it  sends  signals  back  along  the  channels 
from  which  it  has  received  messages,  constrained  by  the  number  of  signals  it  has 
received  from  processes  to  which  it  has  sent  messages.  The  protocol  guarantees 
that  termination  will  be  detected  by  a  signal  sent  from  the  initiating  process  to 
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the  environment,  within  a  finite  number  of  steps  after  all  processes  have  terminated 
[DS80]. 


Misra  and  Chandy’s  termination  detection  scheme  [MC82]  modifies  Dijkstra 
and  Scholten’s  algorithm  to  work  under  the  constraints  of  C.A.R.  Hoare’s  protocol  for 
communicating  sequential  processes  [Hoa78,  Hoa85].  Dijkstra  and  Scholten  assume 
that  processes  may  freely  send  messages  [DS80];  Hoare’s  protocol  adapted  by  Misra 
and  Chandy  assumes  that  a  process  must  wait  to  send  if  the  receiving  process  is  not 
waiting  to  receive  [MC82].  As  a  result,  an  idle  process  can  not  determine  the  cause 
of  its  own  idleness  unless  it  first  queries  its  neighbors.  The  implementation  of  the 
termination  detection  scheme  uses  two  types  of  signals;  A  signals,  corresponding  to 
signals  defined  in  [DS80],  and  B  signals,  used  by  a  process  to  determine  the  waiting 
status  of  its  connected  processes,  and  thus  its  own  status  [MC82]. 

When  the  controller  process  (the  environment)  has  received  a  deadlock  detec¬ 
tion  notice,  it  sends  a  signal  to  each  LP,  to  compute  and  output  the  best  lower  bound 
Wij  for  the  time  of  next  message  transmission  over  each  of  its  output  channels  (i,j). 
This  is  accomplished  in  n  iterations  (n  being  the  number  of  LP’s),  as  described  in 
[CM81].  For  each  LP,,  the  iterative  computation  allows  the  values  for  all  input 
channels  (fc,  i)  computed  in  iteration  m  to  be  used  in  computing  during  iteration 
m  -I-  1,  where  1  <  m  <  n. 

After  computing  W  for  all  output  channels,  each  LP  must  report  its  resumable 
status  to  the  controller.  A  LP  is  defined  to  be  resumable  if  the  set  of  channels  it  is 
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waiting  on  is  different  from  the  set  that  it  was  waiting  on  when  deadlock  occurred. 
After  reporting  to  the  controller,  each  LP  is  free  to  continue  Phase  1  computations 
[CM81]. 

The  authors  of  this  algorithm  have  since  proposed  alternativ'e  methods  for 
deadlock  detection.  In  [CM83],  Chandy  and  Misra,  along  with  L.  Haas,  describe  a 
communications  deadlock  detection  protocol  based  on  the  concept  of  the  dependent 
set. 

The  dependent  set  of  a  process  is  the  set  of  processes  such  that  a  message 
from  a  member  of  the  dependent  set  will  cause  the  process  to  resume  execution. 
Any  process,  upon  finding  itself  idle  relative  to  its  computational  work,  initiates  a 
query  to  all  members  of  its  dependent  set.  If  the  process  receives  replies  equal  to 
the  number  of  queries  it  sent,  then  it  is  deadlocked.  Only  processes  that  are  idle 
take  action  in  response  to  a  query.  When  an  idle  process  receives  a  valid  query,  it 
propagates  it  to  each  member  of  its  dependent  set.  When  a  process  not  the  query 
process  receives  replies  from  each  member  of  its  dependent  set,  it  issues  a  reply  to 
the  process  which  queried  it.  This  algorithm  is  somewhat  flexible  in  that  a  time-out 
may  be  used  to  delay  a  query  computation  until  the  process  has  been  idle  for  some 
arbitrary  time  [CM83]. 

Misra  has  proposed  and  proven  a  deadlock  detection  scheme  based  on  a  marker. 
a  special  message  that  continuously  travels  a  path  that  takes  it  once  through  every 
LP  in  the  simulation  network  [Mis86].  Ad'^itional  message  channels  may  be  reqtiired 
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to  create  the  marker  path.  In  the  marker  algorithm,  any  LP  that  receives  the  marker 
has  the  obligation  to  send  it  on  its  way  when  the  LP  becomes  idle.  Each  LP  maintains 
a  flag  to  signify  whether  or  not  the  LP  has  received  or  sent  any  messages  since  the 
last  visit  by  the  marker.  The  marker  detects  deadlock  if,  with  n  LP’s  in  the  network, 
it  visits  n  consecutive  LP’s  that  have  not  communicated  since  the  marker’s  last  visit. 
The  marker  can  also  be  used  in  deadlock  recovery.  For  that  purpose,  the  marker 
can  record  the  minimum  next  event  time  of  all  the  idle  LP’s  it  has  visited  and  the 
identity  of  the  corresponding  LP.  Upon  detection  of  deadlock,  the  marker  traverses 
to  and  restarts  this  LP  by  advancing  the  LP  clock  to  the  next  event  time.  Markers 
can  also  be  tailored  for  individual  networks.  By  analyzing  the  simulation  network 
prior  to  execution,  additional  markers  can  be  set  up  to  circulate  within  subnetworks 
that  are  prone  to  deadlock  [Mis86]. 

2.2.5  SHADS  Algorithvi  P.  Reynolds  introduced  an  algorithm  for  distributed 
simulation  called  the  Shared  Resource  Algorithm  for  Distributed  Simulation  (SHADS) 
based  on  the  concept  of  active  logical  processes  [Rey82,  Rey83].  In  most  distributed 
simulation  methods,  logical  processes  are  relatively  passive,  unable  to  perform  unless 
driven  by  messages  sent  by  other  processes.  In  the  SHADS  algorithm,  when  a  pro¬ 
cess  needs  to  read  or  write  a  message,  it  attempts  to  access  a  shared  facility,  which 
is  the  storage  point  for  messages  between  a  set  of  two  or  more  LP’s.  The  sequencing 
requirement  and  associated  protocol  guarantees  that  to  access  the  shared  facilitj'.  an 
LP  must  have  the  least  simulaticn  time  of  any  LP  connected  to  that  shared  facility. 


2-17 


Each  shared  facility  holds  only  a  single  message,  so  it  is  possible  for  a  message  to  be 
overwritten  without  ever  being  read. 

Although  it  is  deadlock-free,  the  SRADS  algorithm  is  limited  in  that  it  assumes 
that  the  processes  in  the  underlying  physical  system  are  synchronized  [Rey82].  The 
assumption  is  that  a  reading  process  “knows”  to  poll  the  shared  facility  for  a  message 
at  a  particular  simulation  time.  If  this  is  not  the  case,  as  in  many  simulations,  an 
LP  may  read  an  outdated  message,  resulting  in  an  incorrect  simulation  [NR84]. 

Reynolds  and  D.  Nicols  subsequently  proposed  an  extension  to  SRADS  called 
the  appointment  [NR84],  which  is  a  message  that  a  writing  process  sends  to  a  reading 
process,  providing  the  largest  known  lower  bound  on  the  time  that  a  message  will 
be  sent  to  the  shared  facility.  When  the  reader  reaches  the  appointment  time,  it  is 
blocked  until  the  writer  gives  it  a  signal  to  unblock.  This  does  prevent  the  reader 
from  simulating  too  far  and  then  reading  a  message  from  the  past,  although  it  isn’t 
clear  what,  if  any,  advantages  SRADS  plus  appointments  offers  over  other  blocking 
algorithms,  given  its  inherent  limitations. 

2.2.6  Optimistic  Algorithms  -  Time  Warp  The  distributed  simulation  algo¬ 
rithms  described  above  have  been  characterized  as  “conservative”  [Rey83]  or  “pes¬ 
simistic”  algorithms  [RMM88].  Conservative  algorithms  are  those  which  preserve  the 
proper  sequence  of  events  throughout  the  entire  simulation  [Rey83].  In  other  words, 
at  any  time  during  execution  of  the  simulation,  the  order  in  which  events  have  been 
simulated  at  any  LP  is  correct  and  preserves  event  dependencies,  although  the  logical 
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processes  have  not  necessarily  simulated  the  physical  system  up  to  the  same  point  in 
time.  Conservative  algorithms  can  be  considered  pessimistic,  since  they  imply  that  if 
LP's  were  left  to  run  asynchronously,  the  frequency  of  event  ordering  conflicts  would 
be  high,  and  it  is  therefore  more  efficient  to  prevent  such  conflicts  than  to  correct 
them  as  they  occur. 

In  1982,  D.  Jefferson  and  H.  Sowizral  introduced  an  interesting  alternative  to 
conservative  distributed  simulation  algorithms,  the  Time  Warp  algorithm  [JS8.5]. 
based  on  the  concept  of  virtual  time,  which  Jefferson  analogizes  to  virtual  memor\' 
[Jef85].  In  contrast  to  the  blocking-advance  mechanisms  used  in  conservative  algo¬ 
rithms,  Time  Warp  includes  a  local  time  rollback  mechanism  for  synchronization. 
Processes,  called  objects,  execute  asynchronously,  sending  messages  without  regard 
to  potential  synchronization  conflicts.  When  an  existing  synchronization  conflict 
comes  to  light,  a  rollback  occurs  and  portion  of  the  simulation  must  be  re-executed. 
Because  the  algorithm  assumes  that  synchronization  conflicts  won’t  occur  often, 
Time  Warp  can  be  considered  an  optimistic  algorithm. 

Each  Time  Warp  process  maintains  a  local  virtual  clock,  set  to  the  timestamp 
of  the  last  message  read  by  the  process.  Messages  received  are  placed  in  an  input 
queue  by  increasing  order  of  message  timestamp.  A  process  continues  to  read  from  its 
input  queue  and  calculate  output  messages  until  it  reads  a  message  with  a  timestamp 
less  than  its  local  virtual  clock.  The  process  must  then  roll  back  its  local  virtual 
clock  to  the  value  of  the  newly-  received  message  timestamp.  This  implies  that  all 
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messages  already  sent  to  other  processes  during  the  rolled-back  period  must  somehow 
be  cancelled,  and  all  events  re-simulated  with  the  addition  of  the  new  message.  Local 
virtual  time  can  not,  then,  be  considered  a  measure  of  actual  progress  in  a  simulation, 
as  it  may  be  rolled  back  to  a  previous  point  [JS85]. 

Re-simulating  an  interval  is  simple,  in  that  recently  input  messages  and  previ¬ 
ous  values  of  the  state  variables  are  maintained  by  each  process.  Message  cancellation 
is  accomplished  through  the  use  of  antimessages.  When  a  process  outputs  a  mes¬ 
sage.  it  saves  a  copy  for  possible  later  use  as  an  antimessage.  If  a  message  must 
be  cancelled  due  to  rollback,  the  roll-back  process  need  only  send  the  corresponding 
antimessage.  If  a  matching  message  and  antimessage  meet  in  the  input  queue  of  a 
process,  they  are  mutually  annihilated,  cancelling  the  message.  If  the  message  has 
already  been  processed,  receipt  of  the  antimessage  causes  the  process  to  roll  back 
to  the  time  of  the  message,  which  is  removed  from  the  input  queue,  and  processing 
continues.  Of  course,  this  rollback  may  spawn  antimessages  itself,  so  that  a  single 
rollback  may  initiate  a  “ripple”  of  rollbacks  and  antimessages  through  the  simulation 
[Jef85]. 

Processes  would  be  subject  to  rollback  to  the  beginning  of  the  simulation  and 
would  therefore  have  to  store  their  entire  message  histories,  if  it  were  not  for  Global 
Virtual  Time.  Global  Virtual  Time  is  calculated  as  the  niinimum  of  all  local  virtual 
clocks  and  the  send  timestamps  of  all  messages  that  have  been  sent  but  not  processed. 
Events  that  happened  prior  to  Global  Virtual  Time  are  irrevocably  committed,  so 
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that  processes  must  only  maintain  their  message  histories  from  Global  Virtual  Time 
onward.  Note  that  Global  Virtual  Time  can  be  used  as  a  measure  of  a  simulation's 
progress,  as  it  is  non-decreasing  [Jef85]. 

Time  Warp  has  several  advantages.  As  an  algorithm  it  is  relatively  simple  and 
quite  elegant.  Time  Warp  doesn’t  require  that  messages  be  received  by  a  process  in 
any  order  to  maintain  correctness  of  the  simulation.  Processes  may  be  created  and 
destroyed  in  the  course  of  the  simulation. 

This  flexibility  does  come  at  a  cost,  however.  The  memory  requirements  of 
Time  Warp  are  quite  high.  All  messages  generated  since  Global  Virtual  Time  are 
stored  twice  -  once  in  the  sending  process  and  once  in  the  receiving  process.  This 
requirement  for  memory  may  be  traded-off  against  the  cost  of  computing  Global 
Virtual  Time  more  frequently.  Finally,  there  is  the  computational  time  wasted  by 
rolling  processes  back.  Jefferson  and  Sowizral  claim  that  the  amount  of  processing 
time  that  a  process  spends  on  roll-backs  in  Time  Wary  would  be  spent  blocked  in  an 
equivalent  simulation  using  conservative  algorithms  [JS85].  It  seems,  though,  that 
the  processor  hosting  a  blocked  process  could  still  perform  useful  simulation  support 
computations.  Processing  time  spent  on  roll-backed  events  is  simply  lost. 

2.3  Performance  Studies  of  Distributed  Simulation  ^ 

A  decade  has  elapsed  since  the  first  algorithms  for  distributed  simulation  were 
published,  and  there  is  still  only  a  small  knowledge  base  derived  from  empirical  test- 
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ing  of  these  algorithms.  This  is  to  a  certain  extent  attributable  to  the  lack  of  available 
multiprocessor  computing  systems  for  research  use.  To  partially  overcome  this  hand¬ 
icap,  some  performance  studies  of  discrete-event  simulation  have  been  accomplished 
using  uniprocessor  systems,  ironically,  to  simulate  the  operation  of  distributed  simu¬ 
lation  systems  [CMT9,  Ree85].  The  large  number  of  factors  affecting  the  performance 
of  distributed  simulation  has  also  hindered  the  development  of  heuristics  to  guide 
the  use  of  distributed  algorithms  for  simulation. 


2.3.1  Performance  Measurement  The  primary  metric  used  to  measure  the 
effectiveness  of  distributed  simulation  is  the  speed-up  factor.  Speed-up  conveys  the 
throughput  advantage  of  performing  a  computation  over  multiple  processors.  Speed¬ 
up  is  often  expressed  as  a  function  of  the  number  of  processors,  and  is  said  to  be 
linear  if  Speed-up  increases  linearly  with  the  number  of  processors.  The  formula 
for  Speed-up  factor  is  Ti/T„,  where  is  the  time  to  execute  the  simulation  over 
n  processors.  There  are,  however,  multiple  definitions  for  Tj  in  use  in  distributed 
simulation  literature. 

If  one  is  interested  in  the  absolute  speed  advantage  gained  by  a  distributed 
simulation  over  a  sequential  algorithm  such  as  the  event  list,  then  Ti  would  be  defined 
as  the  time  to  execute  the  simulation  using  the  sequential  implementation  [Fuj88]. 
If  one  wishes  to  emphasize  the  marginal  effect  of  each  additional  processor  on  the 
performance  of  a  given  distributed  algorithm,  then  Ti  can  be  defined  to  be  the  time 
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to  execute  the  distributed  algorithm  on  a  system  of  one  processor  [RM88,  RMM88]. 
Both  of  these  definitions  can  be  commonly  found  in  the  literature. 

Although  Speed-up  is  the  metric  of  primary  concern,  in  certain  cases,  inter¬ 
mediate  metrics  may  be  desirable  to  quantify  a  certain  effect  that  may  impact  the 
attainable  Speed-up.  Fujimoto  defines  the  null  message  ratio  to  be  the  number  of 
null  messages  processed  in  an  implementation  of  the  Null  Message  algorithm  divided 
by  the  number  of  “real”  messages  processed.  The  deadlock  ratio  is  the  number  of 
messages  processed  divided  by  the  number  of  deadlocks  that  occur  in  a  deadlock 
detection  and  recovery  algorithm. 

2.S.2  Factors  Affecting  Performance  The  factors  that  may  affect  the  perfor¬ 
mance  of  a  distributed  simulation  algorithm  are  extremely  diverse,  arising  from  the 
nature  of  the  system  being  simulated,  the  model  constructed  of  the  system,  and 
the  hardware  and  software  environment  under  which  the  distributed  simulation  is 
performed. 

2.3.2. 1  Structure  of  the  Simulated  System  Early  distributed  simulation 
studies  recognized  that  the  structure  of  the  simulated  system  has  a  significant  ef¬ 
fect  on  performance  [PVVM79a].  The  topology  of  interconnections  between  LP’s  has 
been  studied  as  a  variable  in  almost  every  study  of  distributed  simulation  to  date.  A 
related  factor,  the  routing  probability  of  messages,  has  been  shown  to  affect  perfor¬ 
mance  in  some  algorithms  [PWM79a|.  The  distribution  of  the  message  timestamp 


increment,  which  is  the  difference  in  the  timestamp  value  of  a  message  input  by 


a  PP  and  the  timestamps  of  the  resulting  output  messages,  has  been  shown  to  be 
significant.  The  value  of  lookahead,  the  lower  bound  on  timestamp  increment,  has 
been  shown  to  be  especially  significant  in  the  performance  of  distributed  simulation 
algorithms  [Fuj88]. 

2. 3. 2. 2  Simulator  Workload  The  amount  of  computation  performed  by 
a  LP  between  message  input  and  output,  also  called  the  CPU  grain  has  been  shown 
to  positively  correlate  to  the  resulting  speed-up,  [Fuj88,  DR83],  although  the  effect 
of  varying  the  grain  appears  to  be  interrelated  with  the  effects  of  topology  [RM88]. 

Also  of  great  interest  is  the  workload  balance,  which  seeks  to  even  out  the 
computational  load  among  the  available  processors.  Workload  balancing  is  a  topic 
of  interest  within  the  distributed  processing  community  because  better  workload  bal¬ 
ance  implies  higher  throughput,  excluding  the  costs  of  interprocessor  communication 
(IPC).  Chu  et  al  describe  a  saturation  effect  \n  which  the  overhead  of  IPC  increases  to 
the  point  that  throughput  decreases  with  the  application  of  each  additional  processor 
to  the  problem  [CH*80].  Hence,  in  the  decision  process  of  partitioning  a  task  and 
allocating  the  sub-tasks  to  processors,  a  trade-off  exists  between  workload  balance 
and  IPC.  Chu  et  al  note  that  finding  an  optimaJ  allocation  of  tasks  is  computation¬ 
ally  complex,  and  that  it  is  generally  more  efficient  to  use  a  sub-optimal  allocation 
guided  by  heuristics  [CH*80]. 


Machine  architecture  partially  determines  methods  available  for  task  alloca¬ 
tion.  In  shared-memory  systems,  either  static  allocation,  where  tasks  are  perma¬ 
nently  assigned  to  processors,  or  dynamic  allocation,  where  idle  processors  take  work 
from  a  queue  of  processes,  can  be  used  [RM88].  In  loosely-coupled  message- passing 
systems,  static  allocation  is  used;  dynamic  allocation  is  considered  impractical  due 
to  communication  overhead. 

2. 3. 2. 3  Implementation  Factors  The  software  implementation  (e.g.  the 
compiler  used,  the  efficiency  of  the  code)  and  the  machine  architecture  (e.g.  the 
amount  of  memory,  the  communication  overhead,  the  existence  of  shared  memory) 
need  to  be  controlled,  and  their  effects  normalized,  before  direct  comparisons  can 
be  made  among  the  results  of  performance  studies  of  distributed  simulation.  Failure 
to  adequately  do  so  may  result  in  comparative  judgements  made  on  the  basis  of 
implementation  factors,  rather  than  on  the  merits  of  the  respective  algorithms.  The 
ability  to  normalize  the  results  of  previous  performance  studies  eliminates  the  need 
to  replicate  these  studies  in  comparatively  evaluating  some  new  method. 

2.3.3  Chronology  of  Performance  Studies  Early  performance  studies  of  dis¬ 
tributed  simulation  were  typically  concerned  with  demonstrating  the  feasibility  of 
some  particular  algorithm.  Peacock  et  al  (PWM79a]  performed  experiments  with 
their  version  of  the  Null  Message  algorithm  on  a  network  of  microcomputers.  Their 
results  indicate  that  the  topology  of  the  simulation  model  is  a  major  factor  in  the  al- 


gorithm’s  performance,  and  that  topologies  with  many  cycles  may  suffer  from  poor 
performance.  This  study  was  limited  in  scope;  only  a  few  simple  topologies  were 
distributed  over  different  numbers  of  nodes;  no  other  parameters  were  varied.  Simi¬ 
lar  experiments,  described  in  [CM79],  were  conducted  by  M.  Seethalakshmi  using  a 
uniprocessor  simulation  of  a  distributed  system.  These  studies  show  that  the  Null 
Message  algorithm  may  not  be  suitable  for  simulations  of  all  topologies.  In  partic¬ 
ular,  Seethalakshmi’s  results  appear  to  have  motivated  the  development  of  Chandy 
and  Misra’a  Deadlock  Detection  and  Correction  algorithm  [CM81]. 

D.  Reed,  in  [Ree85],  performed  a  uniprocessor  simulation  of  several  topologies 
with  varying  populations  of  messages,  using  both  the  Null  Message  [CM79]  and 
Chandy-Misra  Deadlock  Detection  and  Recovery  [CM81]  algorithms.  His  results 
supported  the  findings  of  [PWM79a]  concerning  the  effect  of  feedback  cycles  on 
the  performance  of  the  Null  Message  algorithm.  Low  message  populations  w'ere 
also  implicated  in  causing  poor  performance  of  the  Null  Message  algorithm.  The 
conclusion  drawn  was  that  deadlock  detection  and  recovery  would  engender  less 
overhead  than  deadlock  avoidance,  especially  where  there  is  much  feedback  and  a 
small  message  population. 

In  more  recent  research,  R.  Fujimoto  evaluated  the  performance  of  the  Null 
Message  or  “Deadlock  Avoidance”  and  Deadlock  Detection  and  Recovery  algorithms 
on  a  BBN  Butterfly  shared  memory  multiprocessor  under  a  variety  of  controlled 
conditions  [Fuj88].  A  symmetric  toroid  topology  with  64  and  16  logical  processes 
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was  used  for  the  experiments.  The  experiments  assumed  that  a  good  workload 
balance  could  be  found,  so  that  in  most  experiments,  symmetric  logical  processes 
were  used. 

Important  performance  considerations  were  shown  by  Fujimoto  to  include  the 
quality  of  the  lookahead  function  for  timestamp  incrementation  in  the  deadlock 
avoidance  algorithm,  and  a  sufficiently  large  message  population  in  the  case  of  the 
deadlock  detection  and  recovery  algorithm.  In  the  latter,  an  "avalanche  effect”  was 
found,  such  that  performance  dramatically  improved  when  message  traffic  reached  a 
certain  level.  An  increase  in  the  message  population  required  to  start  the  avalanche 
effect  resulted  from  introducing  a  process  with  an  asymmetric  timestamp  distribution 
function.  Otherwise,  asymmetry  did  not  significantly  effect  performance.  Dynamic 
process  allocation  was  possible  because  of  the  shared  memory  architecture  of  the 
simulation  testbed,  but  was  generally  shown  to  hurt  performance  compared  to  the 
static  allocation,  most  likely  due  to  the  “perfect”  load  balancing  in  the  simulator 
[Fuj88]. 

D.  Reed,  A.  Malony,  and  B.  McCredie  analyzed  the  performance  of  both 
the  Deadlock  Avoidance  and  the  Deadlock  Detection  and  Recovery  algorithms  on 
a  Sequent  Balance  21000  shared-memory  multiprocessor  [RMM88].  Their  experi¬ 
ments  evaluated  distributed  simulations  of  several  queueing  network  topologies  us¬ 
ing  the  RESQ  paradigm  for  queueing  models,  including  simple  tandem,  feed-forward 
(forked),  cyclic,  central  server,  and  cluster  (nested  feedback)  networks.  Reed  et  al 
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point  out  that  studies  of  simple  topologies  do  not  reflect  typical  simulation  prob¬ 
lems.  while  complex  topologies,  though  realistic,  increase  the  difficulty  in  finding  the 
causes  of  poor  performance  in  a  distributed  simulation.  In  this  study,  the  effects 
of  dynamic  node  allocation  were  studied  in  combination  with  a  variety  of  message 
workloads. 

The  results  of  this  study  were  significantly  more  negative  than  those  of  Fuji- 
moto  [Fuj88].  Speed-up  approaching  N  was  not  achieved  even  in  the  tandem  case 
with  more  than  a  few  nodes.  This  is  surprising,  especially  considering  that  the  Speed¬ 
up  factor  was  calculated  using  the  distributed  algorithm  executed  on  one  processor 
(which  invariably  takes  more  time  than  an  event  list  implementation)  as  baseline, 
yielding  an  artificially  high  speed-up.  A  possible  explanation  given  was  that  some  of 
the  poor  performance  may  be  due  to  machine  architecture  constraints  such  as  bus 
and  memory  contention.  Performance  was  said  to  have  improved  noticeably  when 
PP’s  were  dynamically  assigned  to  processors  for  the  cluster  topology.  Assumptions 
that  may  have  affected  the  results  include  limited  lookahead  in  conjunction  with  the 
deadlock  avoidance  algorithm,  and  no  a  priori  knowledge  of  the  simulation  network. 

Reed  and  Malony  conducted  further  studies  in  which  the  amount  of  compu¬ 
tation  at  each  node,  the  “spin  loop,”  was  varied  to  observe  the  effects  on  central 
server  and  cluster  network  topologies,  considered  to  be  the  most  representative  of 
typical  simulation  models  [RM88].  The  central  server  model  distributed  with  dead¬ 
lock  recovery  showed  no  effect  from  increased  spin  loop,  because  the  presence  of 
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the  “bottleneck"  central  server  led  to  regularly  occurring  deadlock.  In  contrast,  the 
same  topology  with  deadlock  avoidance  yielded  performance  improvement  when  the 
spin  interval  was  increased.  Simulating  the  cluster  network  topology  with  deadlock 
recovery  showed,  as  with  the  central  server,  the  relative  independence  of  Speed-up 
and  spin  interval.  In  a  surprising  result,  the  cluster  model  with  deadlock  avoidance 
suffered  decreased  Speed-up  as  spin  interval  was  increased,  in  direct  opposition  to 
the  central  server  model  results.  This  can  be  explained  when  one  considers  the  large 
number  of  forks  in  the  cluster  network,  and  that  large  time  intervals  between  mes¬ 
sage  transmissions  lead  to  a  ‘‘decoupling”  of  the  LP’s,  and  increased  Null  message 
overhead.  In  this  case,  supposedly  distinct  factors,  topology  and  spin  interval,  were 
actually  interrelated  in  their  effects  on  distributed  simulation  performance  [RM88]. 

2.^  Summary 

The  majority  of  the  algorithms  for  distributed  simulation  have  been  in  exis¬ 
tence  since  the  late  1970’s  and  are  therefore  quite  mature;  most  are  provably  correct 
and  relatively  straightforward.  Distributed  simulation  algorithms  have  been  divided 
into  two  categories:  Conservative  and  optimistic  algorithms.  Performance  studies 
have  been  conducted  on  both  types  of  algorithms,  considering  many  of  the  factors 
that  are  expected  to  significantly  affect  performance.  Some  research  data  is  available 
to  be  used  in  the  formation  of  heuristics  for  applying  these  algorithms.  Additional 
performance  studies  are  required,  attempting  to  normalize  the  effects  of  the  imple- 
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mentation  factors,  so  that  algorithm  performance  can  be  estimated  and  compa 
over  a  wide  range  of  conditions. 


III.  Description  of  Algorithms 


The  algorithms  under  study  are  distributed  simulation  algorithms  built  on  a 
paradigm  of  physical  systems  based  on  the  message-passing  paradigm  of  Chandy 
and  Misra,  while  incorporating  concepts  of  traditional  event-oriented  simulation. 
The  physical  system  may  be  simulated  by  a  distributed  logical  system  using  the  con¬ 
servative  synchronization  algorithm  proposed  by  Chandy  and  Misra  [CM79],  slightly 
modified  with  a  prediction  function,  and  utilizing  an  event  list,  similar  to  that  used 
in  traditional  sequential  simulation,  at  each  Logical  Process  (LP). 

These  algorithms  are  of  interest  because  it  may  often  be  necessary  or  desirable 
to  simulate  a  system  of  N  components  or  Physical  Processes  (PP's)  with  a  distributed 
computing  system  consisting  of  fewer  than  N  processors.  Extending  the  message¬ 
passing  paradigm  of  Chandy  and  Misra  to  create  multiple  processes  within  a  single 
processor  introduces  the  problem  of  process  scheduling,  along  with  complicating 
deadlock  resolution.  The  goal  is  to  find  an  efficient  method  for  simulating  multiple 
PP’s  within  a  single  LP. 

A  natural  candidate  algorithm  for  simulating  a  complex  physical  process  within 
a  single  logical  process  is  the  event  list  method  used  in  sequential  simulation.  The 
linear  nature  of  the  event  list  with  respect  to  time  ensures  that  the  events  within 
each  LP  are  simulated  in  strictly  chronological  order.  The  chronological  ordering 
provided  by  a  monolithic  event  list,  combined  with  the  interprocessor  synchronization 
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method  of  Chandy  and  Misra.  guarantee  that  event  dependencies  are  preser\’ed.  The 
Ciiandy-Misra  Xuil  Message  algorithm  is  easily  extensible  to  accommodate  an  event 
list  implementation  at  each  logical  process. 

This  chapter  presents  the  message-passing  paradigm  of  physical  systems  pro¬ 
posed  by  Chandy  and  Misra  [CM79],  modifies  it  to  conform  to  an  ‘‘event-oriented" 
view,  and  applies  the  resulting  paradigm  to  a  logical  system  for  distributed  discrete- 
event  simulation.  The  problem  of  deadlock  in  the  logical  system  and  the  use  of  Null 
Messages  for  deadlock  avoidance  are  discussed.  Alternative  strategies  for  sending 
Null  messages  are  presented.  These  variants,  which  may  enhance  the  speed  of  the 
logical  system’s  execution  under  certain  circumstances,  will  be  evaluated  in  Chap¬ 
ter  4. 


3.1  The  Physical  System 

Chandy  and  Misra  have  defined  a  paradigm  of  physical  systems  based  on  the 
concept  of  message-passing  PP’s  in  [CM79].  While  this  paradigm  maps  well  into  a 
logical  system  for  distributed  simulation,  it  does  not  capture  all  aspects  of  the  time 
and  state  relationships  in  a  physical  system  that  may  be  of  interest.  To  do  so,  the 
Chandy-Misra  physical  system  paradigm  can  be  reconciled  with  some  “traditional” 
concepts  of  discrete-event  system  modelling.  While  the  message-peissing  model  of 
physical  processes  seems  to  obviate  concepts  such  as  events,  for  example,  events  can 
be  defined  implicitly  in  terms  of  the  state  changes  of  individual  physical  processes. 
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The  concept  of  the  message  can  be  re-  stated  in  terms  of  event  dependencies  among 
PP's  [Mis86]. 


3.1.1  Definitions  It  has  been  observed  by  R.  Nance  that  the  evolution  of 
simulation  has  resulted  in  the  existence  of  many  differing  paradigms  of  physical 
systems,  such  as  event-oriented  and  process-oriented  views,  and  “the  inability  to 
generalize  the  simulation  modelling  task  [NanSl].’’  This  state  of  affairs  has,  in  turn, 
resulted  in  considerable  imprecision  in  the  terminology  of  simulation.  Nance  lias 
proposed  a  set  of  basic  definitions  to  clarify  the  time  and  state  relationships  in  the 
simulation  of  physical  systems  [NanSl].  The  following  definitions  are  incorporated 
into  the  description  of  the  physical  system: 


•  An  Object  is  anything  that  can  be  characterized  by  one  or  more  Attributes  to 
which  values  are  assigned. 

•  A  System  is  comprised  of  objects  and  relationships  among  objects. 

•  The  State  of  an  Object  is  the  enumeration  of  all  attribute  values  of  that  object 
at  a  particular  instant. 

•  An  Event  is  a  change  of  object  state,  occurring  at  an  instant,  that  initiates  an 
activity  precluded  prior  to  that  instant. 

•  An  event  is  Determined  if  the  only  condition  on  event  occurrence  can  be  ex¬ 


pressed  strictly  as  a  function  of  system  time.  Otherwise,  the  event  is  Contiv- 
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•  An  Object  Activity  is  the  state  of  an  object  between  two  events  describing 
successive  state  changes  for  that  object. 

3.1.2  The  Physical  Process  In  describing  their  Null  Message  algorithm,  Chandy 
and  Misra  define  a  model  of  physical  systems  based  on  the  concept  of  communicating 
physical  processes  [CM79].  A  simulated  system  consists  of  some  finite  number  .N  of 
physical  processes  (or  PP’s)  which  represent  the  components  of  an  actual  physical 
system.  The  PP’s  communicate  with  each  other  exclusively  through  messages.  A 
message  in  the  physical  system  can  be  described  as  a  tuple  of  the  form  where 

t  is  the  time  of  transmission  and  m  is  the  message  content.  If  PPi  sends  messages 
to  PPj,  there  exists  a  message  arc  [CM79]  (i,j)  between  PP,  and  PPy  Messages  a’e 
transmitted  instantaneously. 

In  the  physical  process  paradigm,  each  PP  is  described  in  terms  of  mes¬ 
sages  sent  and  messages  received.  The  sequence  of  messages  transmitted  over  arc 
{i,j)  is  of  the  form  s^J  =  m/,')}  where  K  is  the  finite  number 

of  messages  transmitted  over  (2,j)  during  the  simulation  period  [0,Z].  Note  that 
0  <  <1  <  •  •  •  <  <  Z,  reflecting  the  chronological  order  of  message  transmission. 

Chandy  and  Misra  define  the  message  history  oi  arc  (2,7)  at  time  t,  cis  the 

sub-sequence  of  denoting  the  message  tuples  transmitted  over  arc  (2,7  )  up  to  and 
including  time  t.  The  state  of  PP,  at  time  t  is  represented  by  a  set  of  state  variables 
In  the  Chandy-Misra  paradigm,  the  existence  of  some  function  A,  is  assumed. 
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such  that: 

^.(t  +  )  =  INFORMk.{t)yk)  (3.1) 

where  t+  is  the  instant  in  time  directly  after  time  t,  and  IN FORM^iit)  is  the 
information  sent  across  arc  {k,i)  at  time  t,  possibly  NULL  (signifying  no  event),  or 
an  event  message  (Lm)  [CM79].  By  applying  3.1  over  an  interval  [UO)  where  t  <  t. 
tlie  function  5,  can  be  derived  [CM79]: 

'!',(<)  =  (3.2) 

where  is  the  sequence  of  messages  over  {k,i)  in  the  time  interval  The 

existence  of  Bi  means  that  the  state  of  PPj  at  time  t  can  be  computed  by  knowing 
the  state  of  the  PP;  at  some  previous  time,  and  all  messages  that  have  been  received 
in  the  interval  [CM79]. 

In  certain  instances  it  is  possible  to  predict  with  certainty  the  messages  sent 
by  a  PPi  up  to  some  future  time  t  given  hki{t)  for  all  k.  For  example,  consider  a 
single-server,  single-queue  queueing  process  with  one  input  arc  corresponding  to  a 
service  arrival,  and  one  output  arc  corresponding  to  a  service  completion.  With  a 
constant  inter-service  time  of  10  with  no  pre-emption,  the  output  of  the  process  can 
be  deduced  up  to  time  t  + 10,  where  t  is  the  time  that  a  job  entity  begins  service.  The 
value  t  —  t.  associated  with  each  arc  in  the  physical  system,  is  called  the  Lookahead 
function  at  time  t  for  arc  or  L,j{t)  [CM79].  Lookahead  is  computed  with  the 

following  function,  which  is  al.so  assumed  to  be  computable  [CM79]: 

L.,(/)  =  Fq('I'.(0,  INFORMM^k)  (3.3) 


3-5 


Lookahead  value  I.j  provides  the  bound  on  prediction  of  the  future  output  of  PP, 
over  (L  j).  calculated  by  function  C,j: 

+  L.j(O)  =  C„{^,{t),INFORMkr{t)yk)  (3.4) 

which  is  assumed  to  be  computable  [CM79]. 

For  each  PP,.  there  exists  a  set  of  points  in  time  in  the  interv^al  [0.  Z]  such 
that  is  different  from  +  where  is  the  instant  in  time  preceding  time 

These  points  in  time  are  events,  in  accordance  with  the  above  definition.  Taking 
the  set  of  instants  in  time  at  which  changes  and  the  values  of  at  those  times 
yields  the  state  “trajectory”  of  PPj,  the  relation  between  and  simulation  time  t. 
ST RAJ{^i{t),t),  where  0  <  t  <  Z  [Zei76). 

By  3.1,  the  receipt  of  a  message  at  PPi  is  an  event  at  PP,.  It  is  apparent 
from  3.4  that  a  message  transmission  at  PPi  also  reflects  a  change  in  Instants 
in  time  where  changes  in  'L;  occur  are  not  explicitly  defined  in  the  message-passing 
paradigm  unless  they  are  associated  with  a  message  transmission  or  reception. 

3.1.3  Event-Oriented  View  of  the  Physical  Process  In  the  message-oriented 
paradigm  discussed  above,  the  state  of  PPi,  is  a  function  of  its  initial  value 
and  the  messages  received  by  PP,  over  the  simulation  period.  While  this  is  a  valid 
paradigm  of  physical  systems,  this  view  does  not  emphasize  the  changes  in  over 
time.  In  the  event-oriented  view,  changes  in  'P,.  i.e.,  events,  have  primacy.  The 
physical  process  can  be  defined  terms  of  events  as  follows; 


3-6 


Physical  process  i  is  described  as  the  tuple 

where  Mi  is  tlie  set  of  incoming  message  event  types 
Ei  is  the  internal  event  type 
'P,  is  the  state  variable  set  for  PPi 
Ni  is  the  internal  event  generating  function 
Bi  is  the  set  of  state  transform  functions 
D,  is  the  message  generating  function 

3. 1.3. 1  External  Events  A  message  received  at  Pp  from  PPk  is  an 
event  of  type  Mfc,  associated  with  message  channel  where  Mki  €  A/,.  Such 

a  message  transmitted  at  time  t  in  the  physical  system  is  described  by  the  tuple 
(t.rriki).  abbreviated  rrikiit)  .  For  each  Mki  €  Mi,  3Bki  €  Bi,  providing  a  mapping 
such  that:  Bki  :  *5;  x  Mki  — >  Given  a  message  mki{t)  and  'i/iit),  the  values  of 

the  state  variables  of  PP,  at  time  t,  then: 

'l'i(t-h)  =  Bki(^i(t),mki(t))  (3.5) 

where  is  the  instant  in  time  subsequent  to  t.  Note  that  Bki  is  a  computable 
function  from  the  assumption  made  previously  that  3.1  is  computable. 

3. 1.3. 2  Contingency  and  Time  Advance  A  message  event  received  at 
PPi  initiates  an  activity  of  some  duration  (possibly  infinite),  during  which  'P,  is 
invariant.  It  may  be  possible  to  calculate  the  value  p,  where  p  is  the  length  of  time 
that  PP,  will  remain  in  ^P,(<)  if  no  external  events  occur  (the  “time  advance'’  value) 


[Zei76].  The  value  is  calculated  wHh  function  A,  where: 

ta  =  Ai{'^ i{t)),  where  ta  >  0  (3.6) 

If  ta  is  not  defined,  then  is  a  passive  state.  Note  that  A,  is  a  computable  function 
if  3.2  is  computable,  which  is  assumed,  taking  hki{t,n,  Z)  =  NULL  for  all  k.  Time 
1  =  1  +  A,('P,(f)),  the  simulation  time  of  the  next  scheduled  internal  event  at  PP,. 
can  then  be  computed. 

Chandy  and  Misra  do  not  permit  any  PP,  to  send  event  messages  to  itself,  since 
■‘that  effect  can  be  achieved  by  a  process  looking  at  its  own  computation”  [CM79], 
presumably  by  applying  3.4  over  all  output  lines  (f,j)  to  determine  the  next  event 
message  sent  by  P/^,  and  then  using  3.2  to  update  However,  this  procedure 
hides  the  true  nature  of  the  state  change.  The  time  and  state  relationship  is  made 
explicit  by  computing,  at  any  simulation  time  t,  an  internal  event  of  type  P,,  e,(t), 
where  t  =  t  +  .4i('I',(t)),  f  <  t.  If  3.2  is  indeed  computable,  as  is  assumed,  then  e,(t) 
is  derivable  from  The  computable  function  Ni  can  be  defined  such  that; 

e,(t)  =  Ni{<ltiit),t)  (3.7) 

If  'I',(t)  is  a  passive  state,  then  iV,('P(t),t)  =  {Z,NU LL)  for  any  i  >t. 

If  is  not  passive,  then  the  next  internal  event  N EVi{t)  =  A^,(^',(t),  f)  can 
be  scheduled  on  a  contingent  basis.  Whether  the  event  actually  occurs  is  dependent 
upon  event  messages  received  in  the  interval  If  NEV^it)  =  e,(<)  and  message 

m.kiii)  arrives  at  PP,,  t  <  i  <  t,  then  'l’,(t)  =  6|t,('ki(f ), mfc,(t ))  and  the  new  next 


internal  event  is  N EV,(i)  =  A^,('P,(t),0  (which  may  be  the  same  as  N EVi{t)).  If 
N EVi(t)  =  e,(t),  then  '!',(<)  is  calculated  by; 


(3.8) 


3. 1.3. 3  Simultaneous  Events  Multiple  event  messages  arriving  at  PP, 
can  lead  to  the  occurrence  of  simultaneous  events  at  PP,,  as  can  an  event  message 
arriving  at  the  same  instant  as  a  scheduled  activity-ending  internal  event.  The 
treatment  of  simultaneous  events  that  affect  the  same  set  of  state  variables  has  not 
received  a  great  deal  of  attention  in  simulation  literature.  In  most  applications 
of  discrete-event  simulation,  simultaneous  events  are  either  independent  or  depen¬ 
dencies  can  be  safely  ignored  [Par69|.  This  is  not  always  the  case,  however,  and 
in  attempting  to  formulate  a  general  paradigm  for  simulation,  this  issue  must  be 
addressed. 

The  concept  of  transitory  states  is  introduced  from  [Zei76].  These  are  states 
between  two  events,  such  that  the  activity  delineated  by  those  events  has  zero  dura¬ 
tion.  A  transitory  state  at  time  t  at  PPi  is  ^"(t),  where  n  is  the  number  of  events 
that  have  already  occurred  at  PPi  at  time  t.  With  N  events  En{t),n  = 
occurring  at  time  t,  where  E  Mi  U  ^,(t)  will  take  on  N  transitory  states: 


Ei{t) 


'I'.uo 


and  ^^(t)  \V°{t+) 
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Note:  while  'P,  is  no  longer  a  function  strictly  of  time,  the  function  argument 
notation  will  be  used  for  consistency  to  identify  the  time  reference  of  a  value  of 

denotes  the  state  of  PP,  subsequent  to  a  which  is  either  at  time  t  or 

time  t+. 

Events  mki{t)  and  mg,(?)  are  said  to  be  independent  if: 

In  other  words,  the  same  'P,  will  result  from  simulating  the  two  events  in  arbitrary- 
order.  If  these  events  are  not  independent,  an  ordering  relation  at  PPi,  ORDERi. 
is  included  in  the  description  of  PPj,  such  that  if  in  ORDERi^  then 

mki(t)  must  be  simulated  before  mgi{t)  to  yield  a  correct  result.  Failure  to  consider 
this  possibility  may  result  in  the  equivalent  of  a  “race”  condition  in  digital  logic 
[Par69].  Unfortunately,  it  will  be  seen  that  the  distributed  event  list  algorithm  can 
not  prevent  this  condition  from  occurring  in  many  instances. 

3.1.4  Message-Passing  A  message  in  the  physical  system  at  time  t  can  be 
viewed  as  a  manifestation  of  the  dependency  relation,  d,  over  the  set  of  all  events  in 
the  physical  system,  such  that  if  d{Ei,Ey).,  the  occurrence  of  event  E^  implies  the 
occurrence  of  Ey.  This  relation  can  be  defined  over  the  subdomain  {P,}  x  Mj  for 

t 

every  message  arc  (i,  j)  by  some  function  Dij  €  A,  where: 

=  Z)ij('P(0,e,(^)),  Dij  €  A  (3.9) 
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If  a  message  m,^  is  determined  by  event  as  above,  then  e,  is  the  prompting  event 
of  rriij.  By  convention,  a  message  may  only  be  transmitted  at  the  instant  that  its 
prompting  event  has  occurred.  No  contingent  messages  are  permitted.  For  any 
message  over  (?,  j),  no  time  elapses  between  the  internal  event  at  PPi  that  initiates 
the  message  and  the  corresponding  message  event  at  PPj-  A  message  transmis¬ 
sion  of  zero  duration  may  seem  counter-intuitive,  but  it  is  important  to  remember 
that  a  message  reflects  the  simultaneous  change  of  state  of  interdependent  physical 
processes:  it  is  not  itself  a  physical  message. 


3.1.5  Predictability  The  dependency  relation  d  over  the  events  in  the  physical 
system  forms  an  irreflexive  partial  order  (irreflexive,  so  that  no  event  depends  upon 
itself).  Intuitively,  the  dependency  relation  d  reflects  the  order  in  which  events  must 
occur  in  the  system.  Pairs  of  events  for  which  d  is  not  defined  may  be  simulated 
concurrently,  or  in  arbitrary  order  [Mis86]. 


Suppose  a  cycle  of  PP’s  exists,  such  that  for  PPi,  i  =  1,  -  •  •  ,n,  message  arcs 
{i,(i  -f  l)mod  n),  i  =  1,  •  •  • ,  n  exist.  If  every  event  message  m,j(t),  j  =  (i  +  l)rnod  n, 
initiates  an  activity  of  duration  0,  at  the  conclusion  of  which  an  event  message 
mjk(t),  k  =  (j  +  l)mod  n  is  sent,  then  there  exists  a  circular  definition  such  that 
d(mij{t),  mjk{t)),  i  =  I,  ■  •  •  ,n,  j  =  {i  +  l)mod  n,  k  =  {j  +  l)mod  n  exists, 
and  hence  f/(m,j(t),  m,j(t))  exists  (due  to  transitivity).  As  a  result,  each  message 
^  in  the  cycle  depends  upon  itself,  creating  a  situation  of  non-deterministic  inputs  to 

each  PP  in  the  cycle. 
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The  property  of  systems  which  precludes  the  previous  situation  is  called  pre¬ 
dictability.  Predictability  ensures  that  for  each  cycle  in  the  system  of  PP’s,  there  is 
at  least  one  PPi  such  that  >  0,  0  <  f  <  Z.  Chandy  and  Misra  claim  that 

all  physical  systems  have  the  property  of  predictability  [CM79].  It  is  assumed  in 
this  paper  that  all  physical  systems  that  may  be  simulated  possess  the  property  of 
predictability. 

3.2  The  Logical  System 

A  system  of  N  PP’s  can  be  simulated  by  constructing  a  system  of  M  Logical 
Processes  (or  LP’s),  with  M  <  N,  where  each  LP  simulates  a  disjoint  set  of  one 
or  more  PP’s.  called  a  sub-model,  and  each  sub-model  contains  at  least  one  PP. 
LPi  simulates  the  events  of  its  sub-model  5,  in  chronological  order  using  an  event 
list  data  structure  and  associated  operations,  similar  to  those  used  in  sequential 
simulation,  to  ensure  the  correct  simulation  order  of  events  over  the  simulation  time 
interval  (0,  Z). 

Message  events  transmitted  between  physical  processes  in  different  sub-models 
are  simulated  by  messages  sent  between  LP’s.  LPi  and  LPj  communicate  in  the 
logical  system  if  and  only  if  some  PPm  €.  Si  communicates  with  some  PP^  G  Sj  in 
the  physical  system.  In  this  case  a  message  channel  [z,j]  e.xists  between  LP,  and 


3.2.1  The  Logical  Process  The  state  of  sub-model  5,  as  simulated  within  LP, 
is  described  by  the  set  of  variables  STATEi,  which  is  U{^„  |  VPP„  G  S,},  whose 
structure  is  simulation  dependent  and  not  relevant  to  the  operation  of  the  distributed 
algorithm.  The  value  of  any  within  STATE,  is  only  changable  by  the  oper¬ 
ation  EVENT,,  declared  as  EV  ENTi{event,  event  list,  state) — y  state,  eventlist. 
EVENTi  encapsulates  the  external  and  internal  state  transition  and  message-generating 
functions  of  every  PP  within  5,.  The  effect  of  EV ENTi{ST AT E,,e)  on  the  value 
of  ST  AT  Ei  for  any  event  e  is  assumed  to  properly  simulate  the  effect  of  event  e 
occurring  at  its  physical  process. 

The  event  notice  is  the  primary  data  item  of  interest  in  discrete-event  simula¬ 
tion.  An  event  notice  contains  information  that  determines  a  possible  change  in  the 
state  of  the  simulation. 

A  data  item  of  type  event_notice  can  be  described  as  follows: 

type  eventjnotice  is  record  of 
t  :  time;  simulation  time  of  event 
e  :  event_data; 
end  record; 


where,  for  event  E,  E.t  is  the  invariant  simulation  time  of  the  event,  and  E.e,  of  type 
event-data,  includes  the  information  which  determines  the  manner  in  which  the  state 
of  the  simulation  will  change  when  some  state  transition  function,  also  determined 
by  the  event-data  e,  is  applied. 
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3. 2. 1.1  The  LP  Event  List  Event  messages  sent  to  any  PP  within  5,  are 
placed,  in  order  of  t  value,  in  the  LP.’s  event  list,  EV NTLISTi.  Within  the  upper 
bound  on  Ti  set  by  the  synchronziation  constraint  LOOK,,  discussed  below,  the 
simulation  time  at  LP,,  T„  is  advanced  to  the  time  of  the  first  event  in  EV  NT  LIST,. 
and  the  event  is  removed  from  the  event  list  and  simulated.  The  event  list,  then, 
is  a  priority  queue  ordered  by  event  time.  Events  may  be  ordered  within  time  of 
occurrence  by  some  secondary  ordering  function.  The  simulation  of  each  event  ma\’ 
cause  future  events  to  be  scheduled  on  the  event  list,  and  the  transmission  of  event 
messages  to  other  LP’s.  Because  an  event  list  orders  events  chronologically,  events 
may  be  simulated  in  proper  order  by  simply  removing  the  first  event  from  the  list 
and  simulating  its  effects.  Contingent  events  may  be  pre-empted  (cancelled),  by 
searching  the  event  list  sequentially  for  the  event,  which  is  removed  from  the  list  if 
found. 

The  eventJist  is  defined  as  follows: 

structure  eventJist  is 

INSERT  (event, eventJist)  — >  eventJist; 

CANCEL(event,eventJist)  — >  eventJist; 

FINDNEXT(eventJist)  — >  event; 

GETNEXT(eventJist)  — >  event,  eventJist; 

LENGTH(eventJist)  — *  integer; 
end  structure; 


where  INSERT(E,L)  places  event  E  in  eventJist  L  in  increasing 

order  of  event  time  and  returns  the  modified  eventJist  L. 

CANCEL(E,L)  searches  eventJist  L  to  find  event  E,  and  deletes  E  from  L 
if  found 

FINDNEXT(L)  returns  the  value  of  the  first  event  in 
eventJist  L,  without  modifying  L. 
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GETNEXT(L)  removes  the  first  event  from  eventjist  L  and 
returns  its  value,  and  the  modified  eventjist  L. 
LENGTH(L)  returns  the  integer  value  equal  to  the  number 
of  events  in  eventjist  L. 


Methods  of  implementing  the  eventjist  structure  are  not  elaborated  upon  here. 
An  ordered  linear  list  is  the  most  common  “traditional”  implementation,  providing 
event  insertion  and  cancellation  in  0(n)  time  and  next  event  retrieval  in  constant 
time.  An  overview  of  some  newer  methods,  with  empirical  comparisons,  is  found  in 
[Jon86b].  A  discussion  of  the  performance  effects  resulting  from  the  choice  of  event 
list  implementation  is  found  in  Chapter  4. 

3.2. 1.2  LP  Time  Advance  Mechanism  To  ensure  chronological  ordering 
of  all  events  within  5,,  event  messages  received  at  LP,  from  other  logical  processes 
over  every  message  channel  [k,  f]  need  to  be  considered.  It  will  be  shown  that  it  is 
possible  to  calculate  a  lower  bound  on  the  simulation  time  of  the  next  message  that 
will  be  received  at  LP,  over  any  message  channel  [^,z].  This  value,  LOOKi,  can  be 
derived  from  information  transmitted  from  each  LP*. 

A  variable  Tki  for  each  message  channel  is  declared,  s.uch  that  J’k,  equals 
Ek,.t,  where  Ek,  is  the  last  message  to  have  been  transmitted  over  [k.i].  T^,  is  the 
channel  clock  value  for  [k,  z]  [CM79].  Recall  that  the  t  values  of  messages  transmitted 
over  arc  (z,  j)  in  the  physical  system  are  monotonic  increasing  and  bounded  by  Z 
[CM79].  Becau,se  LPk  in  the  logical  system  may  send  messages  originating  from 
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multiple  PP's  in  5^  to  S,  at  LP,,  the  t  values  of  messages  sent  over  [fc,  i]  in  the  logical 
system  are  not  necessarily  monotonic,  but  are  guaranteed  to  be  non-decreasing,  by 
the  chronological  ordering  of  EV NT LISTk-  Given  this,  if  Tki  is  the  t  value  of  the 
last  tuple  transmitted  over  channel  [/:,  i]  at  any  point  during  the  simulation,  then  all 
messages  from  LP^  to  LPi  have  been  sent  up  to  time  Tki-  LPi  “knows”  all  messages 
it  has  received  up  to  time  LOOKi,  where  LOOKi  =  minimum{Tki,  ( V[^,  t]}.  LPi 
can  then  simulate  events  and  calculate  output  event  messages  up  to  and  including 
LOOK,. 

It  would  be  ideal  to  delay  the  simulation  of  events  occurring  at  time  LOOKi 
until  all  possible  event  messages  occurring  at  this  time  have  been  read  by  LPi.  Doing 
so  would  ascertain  the  proper  order  of  execution  among  simultaneous  message  events 
arriving  at  LP,.  However,  constraining  the  advance  of  LOOKi  in  this  way  may  lead 
to  deadlock  of  the  physical  system  in  certain  configurations  of  LP’s,  as  shall  be 
described.  As  a  result,  the  order  of  simulation  of  simultaneous  event  messages  that 
arrive  at  a  LP  is  determined  by  their  order  of  arrival  in  the  logical  system,  and 
can  not  be  specified  a  priori.  This  property  of  the  logical  system  may  limit  the 
applicability  of  the  distributed  event  list  algorithm,  unless  a  corrective  modification 
to  the  algorithm  is  found.  None  is  apparent. 


.3.2.1. ,3  Predicting  Message  Transmissions  It  is  possible  to  calculate  a 
lower  bound  on  the  time  of  the  next  message  to  be  transmitted  over  any  outgoing 


message  channel  [i,  j]  of  LP,.  As  described  above,  a  trivial  lower  bound  may  always 
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be  derived  from  due  to  the  property  of  monotonicity.  A  tighter  bound  may 

be  computable  from  information  in  STATEt,  given  Knowledge  of  5..  When  this 
prediction  is  possible,  the  lower  bound  on  the  time  of  the  next  message  over  any  ['.  j|. 
N EXTOUT,,  may  be  calculated  and  appended  to  each  outgoing  message  tuple.  By 
computing  LOOKj  =  min{N EXTOUTi  lV[t,j]},  LP^  may  then  be  able  to  advance 
Tj  sooner,  since  N EXTOUTi  >  Tj  holds. 

In  computing  N EXTOUT,,  two  possibilities  must  be  considered;  1)  Some 
event  that  is  presently  pending  in  EV  NT  LI  ST,  may,  when  simulated,  prompt  a 
message  over  [LJ]-  Given  the  time-ordering  of  EV  NT  LI  STi,  the  time  of  the  first 
event  in  EV  NT  LI  STi  provides  a  lower  bound  on  the  time  of  such  an  occurrence. 
The  value  of  this  lower  bound,  MINEVNT,,  can  be  applied  to  every  channel 
2)  A  message  event  yet  to  arrive  at  LPi  may  prompt  a  message  over  some  [?,;]. 
To  calculate  a  lower  bound  M I N ABSi  on  the  time  of  this  occurrence,  information 
about  is  required.  The  earliest  that  an  incoming  message  can  arrive  (if  the  logical 
system  is  correct)  is  the  current  simulation  time  T).  To  compute  MIN  ABSi,  Ti  is 
added  to  the  minimum  elapsed  simulation  time  between  an  incoming  message  event 
Mki  over  any  [k,i]  and  the  time  of  a  resultant  message  M,j  over  any  [i,j],  given  no 
other  messages  received  or  events  pending.  This  value  is  denoted  as  M I N ADVi. 
M I N ADVi  is  calculable  by  knowledge  of  the  range  of  the  function  ,4*  for  each  PP^ 
within  5,  and  the  topology  of  communication  of  all  PP’s  within  Si.  The  value 


of  M I y ADV,  is  invariant,  throughout  the  simulation,  due  to  the  invariance  oi  tlie 
physical  system  structure,  and  therefore  need  only  be  calculated  once,  a  priori. 


Calculation  of  S EXTOUT^  is  accomplished  by  taking  the  minimum  of  M IX .ADS, 
and  M I N EV XT,.  The  algorithm  for  the  calculation  of  N EXTOUTij  is  then  given 


Algorithm  3-A: 

{Lower  Bound  on  time  of  ne.Kt  message  over  any  [?.  j]} 

declare  M I X  ABS,  :  time;  (bound  on  time  of  the  next 

message  output  over  an}'  (?,  j]  due  to 
an  incoming  message} 

M I X ADV]  :  time;  (bound  on  the  time  until  the  next 
message  output,  given 
message  received  at  time  Ti) 

M I N EV NTi  :  time;  (bound  on  time  of  next  message 
output  by  due  to  event  on 
the  LP  event  list) 

begin 


MI  NABS,  :=  T,  +  MINADV]; 
if  LENGTH(£'CiVrL/5r,)  =  0  then 
N  EXTOUT, :=  MIN  ABS,- 
else 

MINEVNTi  :=  FINDNEXT(£;ViVrL/5r.).t; 

N  EXTOUT,  -.=  mm{M  IN  ABS,,  MINEVNTi); 
endif: 
end; 


3.2. 1.4  Logical  System  Predictability  The  property  of  predictability  in 
physical  systems  has  been  discussed  previously.  Were  it  not  for  predictability,  -sys¬ 
tems  of  PP’s  connected  in  a  cyclic  topology  could  not  be  simulated  [Bry79  C.M79]. 
and  it  was  assumed  in  Section  3.1.5  that  all  simulated  systems  are  predictable.  In- 


Physical  Process 


Figure  3.1.  Non-Cyclic  System  of  PP’s  with  Cyclic  Mapping 

fortunately,  the  physical  system’s  predictability  does  not  guarantee  the  predictability 
of  the  simulating  logical  system  in  the  distributed  event  list  algorithm,  since  a  non- 
cyclic  set  of  PP’s  may  be  mapped  into  a  cyclic  set  of  sub-models  (LP’s)  as  depicted 
in  Figure  3.1.  In  cases  such  as  this,  it  is  necessary  to  ensure  that  each  cycle  of  LP’s 
is  predictable.  This  can  be  accomplished  by  placing  constraints  on  the  mapping  of 
PP’s  onto  LP’s. 

Previously,  the  only  constraint  on  a.ssignment  of  PP’s  to  LP’s  has  been  that 
there  be  at  least  one  PP  simulated  per  LP.  To  ensure  predictability  of  the  logical 
system,  the  following  condition  is  required; 

Logical  System  Predictability  Condition:  Suppose  a  cycle  of  LP’s  exists,  such 
that  for  LP,.  /  =  1 ,  •  ■  • .  n.  message  channels  [i,  (i  +  1  )mod  n],  ?  =  1.  •  -  •  .  r?  exist.  For 
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every  such  cycle  in  the  logical  system,  there  must  exist  at  least  one  LPj,  such  that 

MINADV)  >  0. 

3.2.2  Logical  Process  I/O  The  message.tuple  is  the  data  item  used  to  trans¬ 
mit  event  messages  and  the  value  of  N EXTOUT,  over  [z,  j]  in  the  logical  system, 
and  is  defined  as  follows; 

type  message.tuple  is  ;ecord  of 
m  :  message; 

next  ;  time;  {lower  bound  on  simulation  time  of  next 
message  over  channel} 

end  record; 
where 

type  message  is 
subtype  event motice: 
subtype  no_event; 
end  type; 


The  subtype  nojevent  contains  only  a  single  possible  value,  NULL,  signifying 
that  no  event  is  being  transmitted  in  the  message.tuple. 

The  utility  of  a  no.event  message  will  become  apparent  in  the  following  sec¬ 
tions.  The  value  of  M.next  for  a  message.tuple  M  sent  over  channel  [k,  z]  is  N EXTOUT^. 

3.2.2. 1  Message  Input  A  message  is  read  into  LPi  from  an  incoming 
message  channel  [fc.z]  by  executing  the  function  READ/k)  at  LPi.  The  boolean 
function  M ESSPEN DING/k)  returns  TRUE  if  a  message  on  channel  [k,  z]  is  avail¬ 
able  to  be  read  by  LP,.  and  FALSE  otherwise.  An  evert  message  read  into  LP,  is 
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inserted  immediately,  in  proper  chronological  order,  into  EV  NT  LI  ST,  (.NULL  mes¬ 


sages,  to  be  discussed  later,  are  read  and  disgarded). 

If  LPi  is  allowed  to  read  all  event  messages  pending  from  each  incoming  channel 
at  any  point  in  the  simuation,  the  number  of  messages  read  into  LP,  at  that  point  can 
not  be  predicted,  because  of  the  asynchronous  operation  of  the  LP’s.  The  length  of 
EV  NT  LIST  may  then  grow  without  any  known  upper  bound,  the  only  upper  bound 
on  the  number  of  incoming  messages  being  the  total  number  of  messages  received 
by  LPi  during  the  entire  simulation.  To  ensure  that  the  amount  of  memory  needed 
to  store  incoming  event  messages  in  EV  NT  LI  ST,  is  known  a  priori,  a  mechanism 
for  controlling  the  flow  of  messages  into  each  LP  is  required. 

The  goal  is  to  constrain  the  reading  of  messages  into  LPi  to  a  known  level 
while  maintaining  the  advance  of  simulation  time  at  LPi.  This  is  accomplished  by 
permitting  LP,  to  read  messages  only  from  the  set  of  input  channels  [it,  i]  such  that 
NEXT(.-  =  LOOKi  for  all  k.  Recall  that  LOOKi  =  min{N EXTki  VL},  so  that 
{[fc,f]  I  Vk}  is  the  set  of  channels  whose  NEXT  values  are  constraining  the  advance 
of  LOOKi,  and  hence  the  simulation  time  T,.  LPi  reads  pending  messages,  if  any 
exist,  from  each  channel  in  [/%  i]  until  NEXT}.-  >  LOOKi.  This  algorithm  is  shown 
below  in  Algorithm  3-B,  which  is  iterated  until  LOOKi  is  advanced.  The  number  of 
messages  read  from  a  single  channel  is  bounded  by  the  number  of  messages  that  can 
possibly  be  sent  to  an  LP  at  any  single  instant  in  simulation  time,  plus  one.  This 
value  can  be  calculated  from  the  interconnection  structure  of  the  physical  system. 
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Algorithm  3-B: 
declare  for  all 

NEXTkt  :  time  {lower  bound  on  time  of  next  message  on  channel  [k,  /]} 
Mki  :  message.tuple  (message  retrieved  from  channel[fc,  ?]) 
endfor: 


begin 

for  all  [A',  ?], 

while  MESSPEND[NG,{k)  and  NEXT^,  =  LOOK,  loop 
iUfc.  :=  READ,{k)\ 

N EXTki  :=  Mki -next-, 
if  Mki-m  ^  NULL  then 
INSERT{Mk,.m,  EVNTLIST,)\ 
enciif; 
end  loop; 
endfor; 

end; 


3. 2. 2. 2  Message  Output  A  message  M  is  sent  from  LPi  over  [i,j]  by 
performing  the  procedure  SENDiiM,j).  Event  messages  from  LPi  may  be  gener¬ 
ated  when  an  event  is  simulated  at  LPi.  Messages  are  created  and  sent  within  the 
EVENTi  procedure.  Note  that  event  messages  transmitted  from  LPi  must  have 
a  time  value  t  equal  to  that  of  the  event  that  prompted  the  message  —  contin¬ 
gent  messages  are  not  allowed.  This  is  in  accordance  w'ith  the  basic  Chandy-Misra 
algorithm  and  the  physical  system  paradigm  described  above  [CM79].  Operation 
SENDi{M,j)  places  the  message  M  in  a  message  tuple  for  transmission  on  chan¬ 
nel  [i,  j],  consisting  of  the  message  itself  and  the  value  of  NEXTij,  computed  using 
Algorithm  3-A  above. 


3-22 


3.2.3  Properties  of  LP  Communication  A  communications  protocol  for  message¬ 
passing  isn't  explicitly  specified  in  [CM79],  but  some  assumptions  are  made  concern¬ 
ing  message  transmission  in  the  logical  system.  Messages  are  assumed  to  be  trans¬ 
mitted  correctly,  and  messages  between  any  two  LP’s  are  assumed  to  be  received  in 
the  order  transmitted,  within  a  finite  time  period.  The  algorithms  discussed  in  this 
paper  also  operate  under  these  basic  assumptions  [CM79]. 

The  above  assumptions  imply  some  requirements  for  the  communications  pro¬ 
tocol.  In  any  implementation  of  the  logical  system,  there  is  some  finite  amount  of 
memory  available  to  buffer  incoming  messages  at  each  LP.  To  ensure  the  correctness 
of  message  transmissions,  then,  the  message  communication  protocol  must  ensure 
that  a  message  over  channel  [f,yj  will  not  be  sent  until  LPj  has  sufficient  buffer 
space  available  to  accommodate  it.  In  that  case,  LP,  may  queue  outgoing  messages 
into  an  output  buffer  of  some  finite  size.  If  LP,  attempts  to  send  a  message  to  LPj 
whose  receive  buffer  for  [Lj]  is  full,  and  LP.’s  send  buffer  for  [i,j]  is  full,  LP,  can 
not  proceed  and  is  blocked.  The  ramifications  of  this  blocking  send  protocol  will  be 
discussed  in  the  following  section  on  the  resolution  of  process  deadlock. 
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•  3.2.4  Summary  of  .algorithm  The  basic  distributed  event  list  algorithm  at 

each  LP  consists  primarily  of  a  sequence  of  two  phases,  executed  iteratively  until 
simulation  termination  conditions  have  been  reached: 


Algorithm  3-C: 

declare  Ti  :  integer;  {simulation  time  at  LP  i} 

for  all  [k,i],  NEXT^i  :  time;  (channel  clock} 

LOOKi  ;  time;  (moving  upper  bound  on  time  advance} 

N E\‘]  :  eventjiotice;  (next  event  to  be  simulated  at  LP  i} 

EV NT LISTi  :  eventjist;  (event  list  for  LP  i} 

STATEi  :  state;  (set  of  state  variables  for  5,} 

begin 

{Initialize} 

r.,  LOOK,  :=  0:  NEV  :=  {Z,NULL); 
for  all  [^^^],  N EXTki  :=  0:  endfor; 

(Initialize  STATEi'.  Schedule  initial  events,  if  any} 

loop  until  (simulation  termination  condition  reached) 

(Read  Phase:  Attempt  to  read  messages  to  advance  LOOKi) 
until  LOOK,  >  T  loop 

Read  messages  on  input  channels  [k,i]:  (Algorithm  3-B} 

LOOKi  '•=  minimum( 
end  loop; 

(Event  Phase:  Simulate  all  pending  events  up  to  LOOKi) 
while  LE  NOT  H{EV  NT  LISTi)  >  0  and  NEV.t  <  LOOK,  loop 
NEV  :=  GETmXT{EV  NT  LISTi):  (Get  next  event  from  list} 
T,  :=  NEV.t; 

(Simulate  next  event} 

EVENTiNEV,  EVNTLIST,,  STATE,); 

end  loop; 

(if  no  events  scheduled  in  interval  7}  to  LOOKi,  advance  T,} 
if  LOOKi  >  T,  then 
T,  :=  LOOK,; 
end  if; 

end  loop; 
end; 
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3.2.5  Deadlock  '.n  the  Logical  System  The  basic  logical  system  as  described 
above  is  subject  to  the  problem  of  logical  process  deadlock,  as  described  in  [CM81]. 
Deadlocks  may  occur  within  both  cyclic  and  acyclic  networks  of  LP's.  A  method  for 
avoiding  deadlock,  the  Null  Message  algorithm  proposed  by  Chandy  and  Misra  in 
[CM79],  has  been  modified  for  use  within  the  distributed  event  list  algorithm,  and 
is  presented  in  this  section. 

3.2.5. 1  Cyclic  Deadlock  .Among  LP’s  A  cyclic  deadlock  may  occur  within 
a  set  of  LP’s  {LPk  |  k  =  1,  •••,/!'}  such  that  message  channel  [d-,  {k  A-  l)mod  K], 
k  =  I,  •  •  • ,  A'  exists  in  the  logical  system.  Such  a  deadlock  occurs  when  each  LPk 
in  the  set  is  attempting  to  read  from  channel  [(k  —  l)mod  A'),  k],  and  no  message  is 
in  transit  over  [(k  —  1  )mod  K,  k],  k  =  1,  •  •  • ,  A'  [CM79].  In  Example  3-1  below,  the 
set  of  LP’s  is  deadlocked,  and  will  not  advance  in  simulation  time.  LP’s  outside  of 
the  deadlocked  set  that  depend  on  event  messages  from  any  LP  in  this  set  will  then 
eventually  become  blocked  due  to  message  “starvation,”  and  so  will  their  dependent 
LP’s,  and  so  on,  until  the  entire  logical  system  is  halted. 

In  Example  3-1.  we  show  that  the  system  in  Figure  3.2  (b),  the  set  of  {A/’,,  LP2} 
is  deadlocked,  and  will  not  progress  in  simulation  time.  A  message  in  the  logical  s3's- 
tem  is  denoted  by  (t,m.next). 
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Example  3-1  In  Figure  3.2  (a),  LPl  has  previously  sent  event  message  (l,ml2,2)  to  LP2. 
LP2  read  this  message,  inserted  (l,ml2)  into  EVNTLISTl,  advanced  LOOK2  =  2,  and  then  ended 
its  Read  Phase.  LP2  simulated  ( l,mr2),  causing  an  internal  event  (2,ev2),  which  prompted  message 
(2,m2I).  (2.m21,3)  was  then  transmitted  to  LPl.  At  this  point  the  next  internal  event  scheduled 
at  LP2  was  (8,ev2).  LP2,  with  LOOK2  =  T2  =  2,  could  simulate  ne  further,  and  so  entered  a 
Read  Pheise.  LPl  has  read  message  (.5,m01,6)  from  [0,1]  in  a  previous  Read  Phcise,  which  caused 
(5,m01)  to  be  placed  as  the  next  event  in  EVNTLISTl.  LPl  could  not  advance  its  clock  to  .5,  being 
constrained  by  NEXT21  =  1.  LPl  then  entered  another  Read  Phase,  attempting  to  read  from 
[2.1].  (2,m21,3)  arrived,  and  (2,m21)  was  read  into  EVNTLISTl  as  the  new  next  event.  LOOKl 
was  advanced  to  3,  and  LPl  left  its  Read  Phase.  This  is  the  status  of  the  set  of  LP’s  as  shown  in 
Figure  3.2  (a). 

LPl  now  enters  its  Event  Phase,  and  with  LOOKl  =  3,  is  able  to  remove  (2,m21)  from 
EV'NTLISTl  and  simulate  it.  As  a  result,  an  internal  event  is  scheduled  at  time  3,  which  is  then 
also  simulated.  The  simulation  of  this  internal  event  prompts  event  (3,ml2),  which  is  sent  over  [1,2] 
in  message  tuple  (3,iti12,4).  With  its  next  scheduled  event  at  time  5,  LPl  can  simulate  no  more, 
and  so  enters  its  next  Read  Phase.  LP2,  in  a  Read  Phase  as  we  recall,  reads  (3,ml2,4),  advances 
LOOK2  =  4,  and  exits  its  Read  Pha.se.  LP2  then  simulates  (3,ml2),  causing  an  internal  event  at 
time  4.  which  prompts  an  event  message  (4,m23,5)  to  LP3  (outside  of  the  cycle).  With  its  next 
internal  event  at  time  8,  LP2  can  not  simulate  further,  and  enters  a  Read  Phase  once  again. 

The  cycle  is  now  in  a  deadlocked  state,  as  shown  in  Figure  3.1  (b).  LPl  will  continue  to 
attempt  to  read  messages  from  [2,1]  and  no  other  channel,  since  NEXT21  =  3  is  constraining 
LOOKl  and  NEXTOl  =  6  is  not.  LP2  is  in  a  Read  Phase,  attempting  to  read  messages  over  [1,2] 
until  LOOK2  is  advanced.  LP2  will  not  leave  its  Read  Phase  until  it  receives  a  message  over  [1,2], 
and  LPl,  as  we  recall,  can  not  send  any  messages  until  it  reads  a  message  over  [2,1], 

In  the  above  example,  each  LP  in  the  cycle  is  each  waiting  to  receive  a  message 
e.xclusively  from  the  previous  LP.  No  message  from  outside  the  cycle  of  LP’s  is 
sufficient  to  change  this  condition.  Therefore,  the  deadlock  condition  is  permanent. 


3. 2. 5. 2  Acyclic  Deadlock  Among  LP’s  Deadlock  may  also  occur  in  acyc¬ 
lic  networks  of  LP’s  under  certain  conditions.  A  “blocking  send”  implementation 
with  bounded  buffers  as  described  in  Section  3.2.3  can  result  in  acyclic  deadlock 
when  a  set  of  LP’s  are  connected  in  a  “fork-join”  configuration,  as  in  Figure  3.3 
below. 


In  any  implementation  of  the  logical  system,  each  LP  has  a  finite  buffer  size  for 
sending  and  receiving  messages  on  each  channel  over  which  it  communicates.  Buffer 


sizes  are  purely  an  implementation  constraint  of  the  logical  system.  Acyclic  deadlock 
is  a  possibility  when  a  LP  may  receive  messages  over  more  than  one  channel,  but 
does  not  receive  messages  over  all  channels  at  equivalent  rates.  If  LP  .4  is  wailing 
to  read  over  a  subset  of  its  input  channels,  the  other  channel  buffers  may  become 
filled,  causing  a  sending  LP  B  to  become  blocked.  If  another  LP  C  sends  sufficiently 
many  messages  to  B,  it  will  eventually  become  blocked,  because  LP  B  will  not  read, 
being  blocked.  If  the  channels  on  which  LP  A  is  waiting  are  dependent  on  messages 
sent  by  LP  C,  then  LP  .4  will  never  receive  messages  over  that  channel,  since  LP  C 
is  blocked,  and  LP  A  will  continue  to  attempt  to  read  [CM79]. 

In  Example  3-2  below,  we  show  that  the  system  in  Figure  3.3  (b)  is  deadlocked, 
and  will  not  progress  in  simulation  time.  We  assume  that  each  message  channel  in 
the  logical  system  has  buffer  size  sufficient  to  hold  a  single  message. 


Example  3-2  In  Figure  3.3  (a),  the  set  of  {LPj,  j  =  1,  -  ,4}  is  not  yet  in  a  deadlocked 

state.  We  observe  that  LPl  had  previously  sent  message  (5,ml2,6)  and  heis  received  message 
(6,m01,7)  from  LPO.  LP2  had  previously  sent  (4,m23,5)  to  LP3.  In  a  subsequent  Read  Phase,  LP2 
read  message  (5,ml2,6),  placed  (5,ml2)  in  EVNTLIST2,  advanced  LOOK2  to  6,  and  then  ended 
its  Read  Phase.  LP3,  in  a  Read  Phase  with  NEXT13  =  NEXT13  =  3,  was  attempting  to  read  over 
both  [1,3]  and  [2,3].  LP3  read  (4,m23,5),  placing  (4,m23)  in  EVNTLIST3.  But  because  NEXT13 
=  3,  LOOK3  remained  at  3,  and  LP3  remained  in  a  Read  Phase,  now  only  attempting  to  read  from 
[1,.3],  in  order  to  advance  LOOK3.  This  is  the  status  of  the  set  of  LP’s  shown  in  Figure  3.3  (a). 

LP2  has  ended  its  Read  Phase,  and  now  enters  the  subsequent  Event  Phase.  LOOK2  =  6. 
and  so  LP2  simulates  (5,ml2).  As  a  result,  an  internal  event  is  scheduled  for  time  6.  This  event  is 
still  within  the  “safe”  simulation  period,  and  so  is  simulated.  The  internal  event  prompts  a  message 
(6,m23),  which  is  sent  over  [2,3]  in  message  tuple  (6.m23.7).  With  no  more  pending  events  before 
or  at  time  6,  LP2  then  enters  a  Read  Phase.  LP3,  as  we  recall,  is  reading,  but  only  over  [1,3], 
so  that  message  (6,m23.7)  occupies  bufTer[2,3].  Meanwhile,  LPl  has  simulated  (6,m01),  causing 
an  internal  event  at  time  7.  With  LOOKl  =  7,  this  event  is  simulated,  which  prompts  message 
(7,ml2,8).  LPl  then  enters  another  Read  Phase. 

LP2,  in  its  Read  Phase,  reads  (7.ml2.8).  LOOK2  is  advanced  to  8,  and  LP2  begins  an  Event 
Phase.  Simulating  (7,ml2)  causes  internal  event  (8,ev2),  which  prompts  message  (8,m23).  LP2 
then  attempts  to  send  (8.m23,9).  but  is  blocked,  because  message  (6,m23,7)  still  occupies  bufrer[2.3]. 
LP2  will  remain  blocked  until  LP3  reads  (6,m23,7).  LPl,  in  a  Read  Phase,  reads  (8.m01,9).  A 


subsequent  internal  e%'ent  at  time  9  prompts  message  (9,ml2,10)  to  LP2.  But  because  it  is  blocked, 
LP2  can  not  read  over  [1,2],  and  (9,ml2,10)  occupies  bufrer[l,2i. 

LPl,  able  to  simulate  no  further,  enters  a  Read  Phase.  In  this  phase,  LPl  reads  (9,m01,10), 
and  consequently  exits  the  Read  Phase,  Simulating  (9,m01)  cuases  an  internal  event  at  time  10. 
This  event  is  simulated,  prompting  message  (10,ml2,ll),  and  a  new  next  internal  event  at  time  14. 

Once  LPl  tries  to  send  message  (10,ml2,ll)  to  LP2,  the  conditions  required  for  deadlock 
are  complete,  as  shown  in  Figure  3.3  (b).  LPl  attempts  to  send  the  message  over  channel  [1,2], 
but  buffer[l,2]  is  full,  so  LPl  is  blocked.  Buffer[l,2]  may  only  become  unblocked  if  LP2  reads  its 
contents,  message  (9,ml2,10).  LP2  is  blocked  sending  over  [2,3],  and  will  only  become  unblocked  if 
buffer[2,3]  becomes  free.  For  that  to  occcur  LP3  would  have  to  read  over  [2,3],  LP3  will  not  read 
over  [2,3]  unless  it  first  reads  over  [1,3].  LP3  can  not  read  a  message  over  [1,3]  until  LPl  sends 
one.  LPl  can  not  send  a  message  over  [1,3]  because  LPl  is  blocked;  hence,  the  deadlock  cycle  is 
complete. 


No  message  from  any  LP  outside  of  the  deadlocked  set  can  affect  the  condition 
of  the  deadlock  set.  The  deadlock  condition  is  then  permanent  and  the  set  of  LP’s 
described  above  will  not  progress  in  simulation  time,  leading  to  non-termination  of 
the  logical  system. 


3. 2. 5. 3  Null  Message  Deadlock  Avoidance  The  method  of  Null  mes¬ 
sages  is  used  to  avoid  the  process  deadlock  that  is  inherent  to  the  basic  system 
[CM79].  A  Null  message  is  a  message  transmitted  with  the  singular  purpose  of  ad¬ 
vancing  the  channel  clock  of  the  channel  on  which  it  is  sent,  thus  possibly  allowing 
the  recipient  LP  to  advance  its  simulation  clock.  The  contents  of  the  message,  the 
no-event  symbol  NULL,  do  not  affect  the  state  of  simulation  at  the  sending  and 
receiving  LP’s. 

The  following  condition  is  sufficient  to  prevent  cyclic  deadlock  in  the  logical 


svstem: 
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(b)  Set  of  LP's  in  Deadlock  State 


Figure  3.3.  Set  of  Logical  Processes  Subject  to  Acyclic  Deadlock 


Null  Message  Condition  1:  For  every  LP,,i  =  once  LP,  exits 

a  Read  Phase,  LF,  sends  at  least  one  message,  either  an  event  message  or  a  Null 
message,  over  every  output  channel  before  entering  another  Read  Phase. 

The  validity  of  the  assertion  that  this  condition  avoids  deadlock  can  be  seen 
intuitively  from  Example  3-1  above.  It  can  be  proven  inductively  for  any  valid  cyclic 
set  of  LP’s  by  showing  that  it  avoids  the  circular  read  condition  of  cyclic  deadlock. 

The  following  condition  is  sufficient  to  prevent  acyclic  deadlock  in  the  logical 
system: 

Null  Message  Condition  2:  For  every  i  =  1,  •  •  • ,  iV;  once  TP,  sends  an 
event  message  (f,  m,  next)  over  message  channel  then  PP,  has  sent  a  message, 
either  an  event  message  or  a  Null  message,  such  that  nextim  >  LASTij  for  all  m, 
over  every  other  output  channel  [r,m]  before  attempting  to  send  another  message 
over 

The  validity  of  the  assertion  that  this  condition  avoids  deadlock  can  be  seen 
intuitively  from  Example  3-2  above.  It  can  be  proven  inductively  for  any  valid  set  of 
LP’s  by  showing  that  it  avoids  the  read — write — read  condition  of  acyclic  deadlock. 

It  is  not  efficient  to  send  a  Null  message  over  [z,_;]  that  has  no  effect  on  the  value 
of  N EXTOUTi  LPj-  We  preclude  this  by  maintaining  at  each  LP,  a  variable  for 
each  outgoing  message  channel  [?,  j],  LAST,j,  set  to  the  value  of  N EXTOUT,  of  the 
message.  Null  or  event,  that  was  last  transmitted  over  A  Null  message  is  only 


I 


transmitted  over  [2,7]  if  the  computed  value  of  N EXTOUTi  to  be  inserted  into  the 
message  is  strictly  greater  than  LASTij. 

Note  that  the  added  constraint  of  LASTtj  on  the  sending  of  a  Null  Message  over 
[i..j]  does  not  violate  the  Null  Message  Conditions.  In  cases  where  N EXTOUT,  = 
LASTij  and  the  sending  of  a  Null  message  is  inhibited,  then,  by  definition  of  LASTij, 
a  message  Mij  has  been  sent  over  [2,7],  such  that  M,j.next  =  N EXTOUTi.  By 
Algorithm  3- .4,  either  the  value  of  Ti  and  the  time  of  the  next  event  in  EV  XT  LI  ST,, 
\I I X EV NT,,  are  at  present  identical  to  those  at  the  time  M,j  was  sent  over  [i.j],  or 
the  value  of  M I X EV .XT  is  the  same  and  equal  to  LOOK,  at  the  time  AT,  was  sent. 
If  a  Read  Phase  had  occurred  in  the  intervening  time  since  the  last  message  send, 
LOO  A',  would  have  advanced  (Algorithm  3-C),  precluding  the  latter  condition.  The 
value  of  T  would  have  increased  with  the  value  of  LOOKi,  due  to  the  sending  of  .U,j 
(Algorithm  3-C),  precluding  the  former  condition.  No  Read  Phase  has  then  occurred 
since  the  time  of  the  last  message  over  [e,  jj.  A  Null  message  is  not  then  necessary, 
and  Null  Message  Condition  1  is  preserved.  Null  Message  Condition  2  is  preserved 
because  a  Null  message  is  withheld  only  if  some  message  has  been  sent  over  [Aj] 
with  next  message  component  next,j  =  N EXTOUT.  X EXTOUT  ^  LAST,m  for 
any  [i.m]  due  to  monotonicity  of  7).  Then,  if  a  Null  Message  is  withheld  over  [?,  j]. 
it  is  guaranteed  that  LAST,j  >  LAST^,  for  all  m,  and  Null  Message  Condition  2  is 


preserved. 


We  can  now  describe  an  algorithm  for  procedure  S  END  NULL, {j)  to  send  a 
Null  message  over  output  channel  [i,7]  such  that  the  Null  Message  Conditions  are 


fulfilled: 

Algorithm  3-D: 

begin 

Compute  N EXTOUTi'.  {Algorithm  3-A} 
if  N EXTOUT,  >  LAST,j  then 
SEN  D,((NULL,N  EXTOUT,),  j); 
LASTj  :=  N EXTOUT, ; 
endif; 
end: 


The  Null  Message  Conditions  can  be  implemented  at  LPi  with  the  following 
straightforward  algorithm: 

Algorithm  3-E  (1); 

begin 

{Initialize} 

T„  LOOK,  :=  0:  NEV  -.=  {Z,NULL)- 
for  all  [k.i],  NEXTki  •=  0:  endfor; 

{Initialize  STATE,'.  Schedule  initial  events,  if  any} 

loop  until  {simulation  termination  condition  reached} 

for  all  [i,^]  which  did  not  send  messages 
during  previous  Event  Phase, 

SEN DNU LLiij):  {Algorithm  3-D} 
endfor: 

Perform  Read  Phase; 

{Attempt  to  read  messages  until  advance  LOOK,] 

Perform  Event  Phase; 

{Simulate  pending  events  up  to  LOOK,} 
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{if  no  events  scheduled  in  interval  T,  to  LOOKi,  advance  T,] 
if  LOOK,  >  T,  then 
r.  ;=  LOOK,; 
endif; 


end  loop; 
end: 


It  is  possible  to  take  advantage  of  some  of  the  properties  of  Null  messages  to 
modify  the  Read  Phase  of  the  LP  for  greater  efficiency.  Unlike  an  event  message, 
a  Null  message  received  at  LPj  over  [k,t]  has  no  message  content  to  store  alter 
its  message  time  has  been  used  to  update  NEXTki-  It  is  therefore  possible  to 
read  any  number  of  Null  messages  over  an  input  channel  without  exceeding  storage 
bounds.  Because  the  time  associated  with  each  successive  Null  message  is  guaranteed 
to  be  increasing  (by  Algorithm  3-D),  only  the  last  Null  message  in  a  sequence  of 
Null  messages  in  an  input  buffer  is  of  interest  to  us.  The  situation  may  arise  in 
which  sequences  of  Null  messages  in  an  input  buffer  cause  many  small  increments  of 
simulation  time  advance.  Since  each  time  advance  at  LPi  leads  to  a  Null  message  sent 
over  every  channel  [i.j],  this  “thrashing”  from  Read  Phase  to  Event  Phase  intensifies 
the  thrashing  effect  in  recipient  IP’s,  increasingly  fragmenting  the  simulation  time 
adv'ance  and  clogging  system  buffers  with  superfluous  Null  messages.  Thrashing  is 
alleviated  by  modifying  Algorithm  3-B  so  that  a  Null  message  received  over  [k.i] 
will  not  cause  reading  over  t]  to  end  unless  no  more  messages  are  pending  over 

[k.,]. 
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3.2.6  .\hdl  Mes.sag€  Variants  Null  Messages  are  overhead  in  the  logical  sys¬ 
tem,  contributing  nothing  to  the  actual  execution  of  the  simulation  model.  In  order 
to  maximize  throughput,  then,  it  might  seem  wise  to  restrict  the  conditions  for  the 
transmission  of  Null  messages  to  the  minimum  required  to  avoid  deadlock.  This,  it 
shall  be  seen,  is  often  not  the  case. 

Null  messages  sent  over  [?,  j  j  perform  their  function  of  deadlock  avoidance  by 
incrementing  the  value  of  channel  clock  N EXTOUT,  at  LPj.  If  N EXTOUT,  = 
LOOKj  before  a  Null  message  is  sent  over  then  the  transmission  of  the  Null 
message  may  allow  LPj  to  advance  LOOKj  sooner  than  it  ordinarily  could.  This 
may  be  particularly  beneficial  if  a  large  percentage  of  events  in  the  logical  system 
are  internal  to  some  LP,  with  infrequent  messages  between  LP’s. 

The  following  algorithms  are  variants  on  the  basic  algorithm  that  change  the 
conditions  under  which  Null  messages  are  sent,  for  the  purpose  of  increasing  the 
throughput  of  the  logical  system.  A  performance  analysis  and  empirical  compari¬ 
son  of  these  methods  versus  the  “basic”  method  described  above  can  be  found  in 


Chapter  4. 


3.2.6. 1  Null  Message  Algorithm  with  Stimulus  Nulls  Prof  B.  Donlan  at 
the  Air  Force  Institute  of  Technology  has  devised  a  method  of  sending  additional 
Null  messages  purely  to  improve  Speed-up.  The  extraneous  Null  messages,  known 
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as  stimulus  Nulls  for  their  effect  on  throughput,  are  transmitted  after  the  execution 
of  a  certain  number  of  events,  specified  as  a  ratio  of  events  to  stimulus  Nulls. 

Algorithm  3-E  (2); 

declare  EVNTCNT  ;  integer  {Number  of  events  since  last  transmission  of  stimulus  Nulls} 
NULLRATIO  ;  integer  (Number  of  events  executed  before  stimulus  Nulls  are  sent} 


begin 


{Initialize} 

EVNTCNT  :=  0; 

loop  until  {simulation  termination  condition  reached} 

{Regular  Null  send  reciuired  to  fulfill  Null  Message  Condition} 
for  all  [i.j] 

which  did  not  send  messages  during  previous  Event  Phase, 

S  END  NULL, {jy.  {Algorithm  3~D} 
endfor; 

Perform  Read  Phase; 

{Attempt  to  read  messages  until  able  to  advance  LOOKi] 

{Event  Phase;  Simulate  pending  events  up  to  LOOKi} 
while  LENGTH{EVNTLIST,)  >  0  and  NEV.t  <  LOOK,  loop 
NEV  :=  GET}iEXT{EV  NT  LI  ST,y  {Get  next  event  from  list} 
T,  :=  NEV.t; 

{Simulate  next  event} 

EVENTiiNEV,  EVNTLIST,,  STATEi); 

{Send  Stimulus  Nulls} 

EVNTCNT  :=  EVNTCNT  +  1; 
if  EVNTCNT  i  NULLRATIO  then 
for  all  [Li]? 

SENDNULL,(jy 

endfor; 

EVNTCNT  :=  0; 
endif; 
end  loop; 

{if  no  events  scheduled  in  interval  T,  to  LOOKi,  advance  T,} 
if  LOOKi  >  T,  then 
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r,  :=  LOOK,: 
enclif; 

end  loop; 

end; 


Preliminary  studies  have  shown  that  stimulus  Nulls  can  provide  significantly 
improved  Speed-up  for  some  systems  of  LP’s.  As  the  stimulus  null  ratio  is  increased 
beyond  a  certain  point,  the  incremental  increases  in  Speed-up  diminish  rapidly  as  the 
Null  message  overhead  takes  its  toll.  The  reasons  for  this  phenomenon  are  discussed 
fully  in  Chapter  4. 

3. 2. 6. 2  Nxdl  Message  Algorithm  with  Timed  Nulls  Another  variation 
on  the  basic  Null  message  scheme  is  that  of  transmitting  Null  messages  only  after 
some  amount  of  processing  time  hris  elapsed.  This  algorithm  was  proposed  by  Misra 
in  [Mis86],  and  can  be  incorporated  into  the  distributed  event  list  algorithm. 

To  employ  this  algorithm  variant  it  is  necessary  to  modify  Null  Message  Con¬ 
dition  1.  Rather  than  specifying  that  a  message  be  sent  over  every  output  channel 
between  each  Read  Phase,  the  following  condition  can  be  specified; 
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Null  Message  Condition  la  For  every  LP,,  z  =  1,  •  ■  •  ,  A';  once  LP,  begins 
a  Read  Phase,  LPi  may  send  at  least  one  message,  either  an  event  message  or  a  Null 
message,  over  every  output  channel  [i,  j]  before  entering  another  Read  Phase,  and 
LPi  will  send  a  message  over  every  output  channel  within  a  finite  amount  of 
time. 


This  condition  is  fulfilled  in  the  following  algorithm  for  transmission  of  time- 
driven  Null  messages; 

Algorithm  3-E  (3):  declare  clockii  reaLtime:  {processor  clock  time  at  LP,} 

NU LLTIMEi  ;  reaLtime;  (specified  elapsed  time  between 

Nulls  over  all 

TIMEOUTi  :  reaLtime;  (time-out  value  for  Null  send} 


(Initialize) 

TIMEOUTi  :=  clock,  -H  NU  LLTIMEi; 
loop  until  (simulation  termination  condition  reached) 

(Read  Phase:  Attempt  to  read  messages  to  advance  LOOKi} 
until  LOOKi  >  T  loop 

Read  messages  on  input  channels  [k,i]:  (Algorithm  3-B) 

LOOKi  :=  N EXTki); 

(Send  Nulls  over  each  channel  [i,j]  if  the  Time-out  for  Null  send  has  expired) 
if  clocki  >  TIMEOUTi  then 
for  all  [Lj], 

SENDNULL,{j); 

endfor; 

TIMEOUT,  ;=  clock,  -f  NULLTIME,; 
endif; 

end  loop; 

Perform  Event  Phase:  (Simulate  pending  events) 
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{if  no  events  scheduled  in  interval  T",  to  LOOKi, 
advance  T,} 
if  LOOK,  >  Ti  then 
T,  LOOK,; 
endif; 

end  loop; 


Naturally,  the  value  of  NULLTIMEi  at  each  LPi  should  effect  the  efficiency 
of  this  algorithm  in  a  given  situation.  The  performance  effects  of  the  choice  of 
jS  ULLTI M E,  are  discussed  in  Chapter  4. 

3.2.7  Bounds  on  Required  Memory  Each  LP  in  the  logical  system  requires 
a  bounded  amount  of  storage,  and  the  sum  of  the  storage  required  by  all  LP’s  is 
comparable  to  the  storage  required  by  an  equivalent  sequential  simulation.  This 
property  of  the  logical  system  is  one  shared  by  the  original  Null  Message  algorithm 
[CM79]. 

The  set  of  state  variables  for  LPi,  ST  AT  Ei,  represents  the  states  of  all  PP’s  in 
Si.  In  a  logical  system  of  N  LP’s,  Li{ST ATEi,i  =  1,  -  •  •  ,iV}  represents  the  state  of 
the  logical  system.  The  number  of  PP’s  is  bounded,  and  the  amount  of  information 
required  to  capture  the  state  of  any  PP  must  be  bounded.  Otherwise,  the  physical 
system  could  not  be  simulated  [CM79].  The  storage  required  to  maintain  the  states 
of  all  PP’s  in  the  distributed  system  is  the  same  as  that  required  in  an  equivalent 


sefiuential  simulation. 


A  method  of  controlling  message  input  into  LF,  as  a  way  of  bounding  the 
required  size  of  the  event  list  EV  NT  LIST,  has  been  introduced  above  in  Section 
3.2.2. 1.  Given  the  properties  of  the  logical  system,  it  can  be  shown  that  the  required 
size  of  EV  NT  LIST,  is  bounded  by  a  predictable  amount.  It  is  assumed  that  every 
FFm  €  Si,  m  —  1,  •  •  • .  :V/  may  have  a  pending  internal  event  at  any  time.  Thus  the 
number  of  events  in  EV  NT  LI  STi  is  bounded  by  M  plus  the  maximum  number  of 
event  messages  that  are  pending  in  EV  NT  LIST,  at  any  point  during  the  simulation. 

From  Algorithm  3-C.  we  know  that  in  any  single  Read  Phase,  messages  are 
read  until  LOOK,  can  be  advanced  from  time  T,,  and  that  messages  read  over  any 
channel  [k,  i]  during  any  Read  Phase  have  the  event  time  component  t  =  T,,  except 
for  the  last  message  that  is  read  over  [fc,  t]  in  any  Read  Phase,  which  may  have 
time  component  t,  T,  >  t  >  LOOKi.  The  maximum  number  of  messages  that 
can  be  sent  at  an  instant  in  simulation  time  from  a  given  PP  to  another  is  one. 
by  monotonicity  in  the  physical  system.  Because  each  event  message  represents  a 
message  transmitted  in  the  physical  system,  the  number  of  messages  that  can  be 
read  into  LPi  during  a  Read  Phase  is  bounded  by  the  the  sum  of  the  number  of  PP’s 
that  can  send  messages  to  each  PP  within  5,,  plus  one  additional  message  for  each 
input  channel  [k,i].  This  number  is  known  from  the  configuration  of  the  physical 
system,  without  examining  any  PP  internally. 
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It  follows  from  Algorithm  3-C  that  any  message  read  into  LP,  has  a  simulation 
time  component  t  <  LOOh'i,  and  thus  will  be  simulated  during  the  subsequent  Event 
Phase.  Given  this,  the  number  of  event  messages  in  EVNTLIST,  from  outside  of 
LP,  is  bounded  by  the  the  maximum  number  of  event  messages  that  can  be  read 
during  a  single  Read  Phase,  as  described  above. 

Event  messages  can  also  be  sent  between  two  PP’s  within  LPi  during  an  Event 
Phase  of  processing.  Because  each  event  message  has  the  same  time  component 
as  the  internal  event  that  prompted  it,  all  event  messages  internal  to  LPi  that  are 
prompted  in  a  particular  Event  Phase  are  also  simulated  in  that  same  phase,  and 
before  any  later  event  is  simulated,  due  to  the  chronology  of  EV NTLISTi.  The 
maximum  number  of  event  messages  internal  to  LPi  pending  at  any  point  in  time  is 
then  the  maximum  number  that  can  be  prompted  in  a  single  instant  in  simulation 
time.  As  in  the  case  of  external  event  messages,  this  is  equal  to  the  number  of  PP’s 
in  5,  that  send  messages  to  each  PPm  €  5,,  m  =  1,  •  •  • ,  M,  due  to  the  monotonicity 
of  messages  in  the  physical  system. 
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Because  the  order  in  which  internal  events  and  message  events  at  an  instant  in 


simulated  time  at  LP,  are  simulated  can’t  be  predicted  in  advance,  assume  that  all 
event  messages  at  a  point  in  simulated  time  are  prompted  before  an\'  are  simulated 
(worst  case  for  bounding  EV  NT  LI  STi).  The  maximum  number  of  event  messages 
pending  is  the  sum  of  the  maximum  number  of  event  messages  from  within  LF, 
and  those  read  in.  Combining  these  two  quantities  with  the  maximum  number  of 
pending  internal  events  at  LP;  gives  the  upper  bound  on  the  number  of  messages  in 
EV  NT  LI  ST  as: 

£  \INm\  +  K  +  M 

m=:l 

where  IN^  is  the  set  of  PP’s  that  may  send  messages  to  PPm  €  ■S’,- 

It  is  submitted  that  the  size  of  each  event  is  bounded,  as  an  event  reflects  a 
single  change  in  value  over  a  number  of  state  variables,  and  the  number  of  state 
variables  of  each  PP  is  bounded,  for  reasons  described  previously. 

The  storage  overhead  for  the  operation  of  each  LF,,  consisting  of  the  variables 
declared  in  the  preceding  algorithms,  is  also  bounded.  Each  declared  variable  is 
either  a  scalar  quantity,  such  as  LOOKi,  or  a  one-dimensional  array  of  scalars  over 
each  inpuc  channel  such  as  NEXT^i,  or  over  each  output  channel  [i,i],  such 

as  LASTij.  Given  that  the  number  of  LP’s  is  finite,  each  of  these  arrays  is  bounded. 
Hence,  the  storage  overhead  at  each  LP  is  bounded  for  a  given  system  of  LP’s. 
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Given  a  blocking  send  communications  protocol  as  described  in  this  chapter, 
the  amount  of  storage  dedicated  to  message  buffers  at  each  LP  may  be  arbitrarily 
chosen  without  affecting  the  correctness  of  the  logical  system.  The  chosen  buffet- 
size,  however,  will  generally  affect  the  execution  time  of  the  logical  system. 
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IV.  Performance  Analysis 


4-1  Performance  Characteristics  of  the  Event  List 

The  event  list  is  the  primary  data  structure  of  the  distributed  event  list  algo¬ 
rithm.  The  efficiency  of  the  implementation  of  the  event  list  at  each  logical  process 
will  have  a  major  effect  on  the  execution  time  of  the  distributed  event  list  simula¬ 
tion.  Similarly,  the  event  list  implementation  in  the  sequential  simulation  used  for 
comparison  will  affect  the  computed  speed-up.  The  distributed  event  list  algorithm 
will  be  shown  to  take  advantage  of  the  time  complexity  of  event  list  operations  to 
provide  a  speed-up  factor  that,  in  some  cases,  exceeds  what  had  been  thought  to  be 
the  theoretical  bound  on  achievable  speed-up. 

The  event  list  is  a  priority  queue  of  event  notices,  ordered  by  event  simulation 
time.  The  primary  operations  on  the  event  list  are  next  event  retrieval  and  event 
notice  insertion,  although  others,  such  as  previewing  the  first  event  and  deletion  of 
an  arbitrary  e\’ent  from  the  list,  may  be  performed.  Next  event  retrieval  removes 
and  returns  the  first  event  notice  in  the  event  list.  Event  notice  insertion  operates 
on  a  given  event  notice  to  insert  it  in  its  proper  order  in  the  event  list. 

4.1.1  The  Linear  List  Implementation  Several  physical  implementations  of 
the  event  list  have  evolved.  The  “classical”  implementation  is  an  ordered  linear 
list,  which  was  used  in  early  simulation  languages,  such  as  GASP  [Pri74],  and  is 
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still  in  fairly  wide  use  today.  Its  major  advantage  is  ease  of  implementation.  The 
event  list  used  in  the  implementation  of  the  distributed  event  list  algorithm  and 
the  comparison  sequential  simulations  is  a  doubly-linked  linear  list  implemented 
with  FORTRAN  arrays.  Next  event  retrieval  with  the  linear  list  involves  simply 
removing  the  event  notice  at  the  head  of  the  list,  and  is  therefore  done  in  constant 
time.  Insertion  of  an  arbitrary  event  notice  into  the  event  list  requires  that  the  items 
in  the  list  be  searched  sequentially  for  the  correct  point  of  insertion.  In  asymptotic 
notation,  event  notice  insertion  with  a  linear  list  is  therefore  accomplished  in  0{L) 
time,  where  L  is  the  number  of  events  in  the  event  list  [VD75]. 

The  dependence  of  event  list  insertion  time  on  the  number  of  event  notices  in 
the  list  at  the  time  of  insertion  is  illustrated  by  the  following  timing  experiment. 
Using  a  linear  list  implementation  of  the  event  list,  event  lists  of  size  10,  100,  and 
1000  were  created,  in  which  the  simulation  time  increments  of  events  in  the  list  were 
exponentially  distributed.  In  this  experiment,  100  events  (with  the  same  distribution 
of  inter-event  times)  were  inserted  into  each  event  list;  an  event  was  removed  from 
the  front  of  the  list  prior  to  each  insertion  to  maintain  consta.nt  list  size.  The  total 
time  to  accomplish  the  insertions  only  was  measured  for  each  event  list  size.  The 
results  of  this  experiment,  shown  in  Table  4.1,  demonstrate  the  correlation  between 
event  li.st  size  and  list  insertion  time. 

4-1.2  Theoretical  Bounds  on  Speed-Up  This  dependency  of  insertion  time  on 
the  number  of  event  notices  in  the  event  list  has  important  implications  for  the 
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10 


100 


Table  4.1.  Insertion  into  Linear  List  with  Exponential  Inter-Event  Times 

attainable  speed-up  factor  using  the  distributed  event  list  algorithm.  It  has  been 
assumed  in  previous  literature  [BJ85,  Hei86]  that  there  is  a  bound  on  the  speed-up 
factor  of  any  distributed  simulation  equal  to  N,  where  N  is  the  number  of  processors 
used.  This  is  a  theoretical  bound,  neglecting  communications  and  synchronization 
overhead,  and  has  not  l)een  thought  to  be  attainable  [BJ85]. 

The  underlying  assumption  behind  a  speed-up  bound  of  N  is  that  the  execution 
time  T  of  a  simulation  depends  upon  the  number  of  events,  E,  occurring  in  the 
simulation  model,  and  processing  time  Te  per  event,  bounded  by  some  constant. 
The  execution  time  T  of  the  sequential  simulation  is  then  E  Te-  The  theoretical 
minimum  execution  time  for  a  distributed  execution  of  the  simulation  model  over 


proce.ssors  P„,  n  =  1,  •  •  •  ,  iV  is  then  equal  to 


MAX{E^,  n  =  l,---,N]TE 


where  En  is  the  number  of  events  occurring  in  the  simulation  sub-model  at  processor 

n.  This  value  is  minimized  for  a  given  number  of  processors  when 

E\  —  E2  =  •  ■  •  =  Ejv  =  E/N.  The  absolute  minimum  execution  time  for  a  dis¬ 


tributed  simulation  over  N  processors  occurs  when  events  are  evenly  distributed,  and 
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is  [E  !  N)Te-  Absolute  minimum  speed-up  can  be  then  computed  as  {ETe)/{{E  ( X)T[r) 
X. 

The  assumption  of  a  linear  relationship  between  the  number  of  events  in  a 
simulation  and  its  execution  time  is  easily  shown  to  be  unjustified  by  considering 
the  simulation  overhead  in  a  sequential  simulation  algorithm.  Super-linear  time 
complexity  of  the  sequential  algorithni  in  terms  of  E  will  be  demonstrated  for  a  linear 
list  implementation  of  the  event  list;  similar  results  can  be  shown  for  other  event  list 
implementations,  although  the  super-linearity  of  the  more  efficient  implementations 
is  not  as  dramatic. 

A  non-preemptive  sequential  simulation  algorithm  is  considered,  such  that 
scheduled  events  are  never  deleted  from  the  event  list  prior  to  their  removal  for 
execution.  Deletion  of  an  arbitrary  event  notice  from  the  event  list  is  of  the  same 
time  complexity  as  an  insertion,  so  that  omitting  the  possibility  of  the  former  does 
not  weaken  the  argument,  as  will  become  evident. 

In  the  general  sequential  event  list  algorithm,  the  basic  iteration,  performed 
once  for  each  of  E  events  in  the  simulation,  is  as  follows:  1)  Remove  the  first  event 
from  the  event  list.  2)  Calculate  the  effects  of  the  event,  including  updating  the 
state  variables  of  the  simulation,  possibly  calculating  new  events  and  inserting  them 
in  proper  order  in  the  event  list.  The  execution  time  of  the  sequential  simulation 
can  be  analyzed  by  considering  the  number  of  events,  E,  and  the  following  periods 


Tc{e)  -  time  required  to  calculate  a  new  event  e 
Tr{e)  -  time  required  to  remove  event  e  from  the  event  list 
0  Tu(e)  ~  time  required  to  update  state  variables  due  to  event  e 

Ti{e)  -  time  required  to  insert  event  e  into  the  event  list 


Due  to  the  nature  of  the  linear  list,  rr(e)  is  constant  for  all  e  =  1,  •  •  • ,  E,  and  so 
has  0(1)  time  complexity.  Assume  Tu(e)  and  Tde)  are  also  0(1)  for  all  e,  bounded 
by  constant  values.  (Because  the  goal  of  the  argument  is  to  show  super-linearity  of 
the  sequential  algorithm  to  E,  this  is  the  worst  case  for  the  argument.) 


Now  consider  Ti{e),  which  with  a  linear  event  list  is  of  0(Le)  time  complexity, 
where  Lg  is  the  number  of  events  in  the  event  list  at  the  time  of  insertion  of  event 
e.  One  upper  bound  on  Lg  is  E,  but  it  can  be  seen  intuitively  that  it  is  a  trivial  one 
-  since  each  event  must  be  inserted  into  the  list,  there  is  no  event  that  is  inserted 
when  there  are  already  E  events  in  the  event  list.  Similarly,  there  exists  only  one 
possible  event  e’  such  that  Lg'  =  E  —  1  at  the  time  of  insertion.  Because  the  value 
of  Lg  is  bounded  by  Lg_i  +  1,  a  meaningful  bound  on  insertion  time  can  only  be 
obtained  by  considering  the  set  of  event  insertions  in  the  simulation  as  a  whole. 


The  total  time  to  insert  all  of  the  events  of  the  simulation  into  the  event  list  is 
bounded  by  a  function  of  Lg.  This  value,  the  sum  of  the  lengths  of  the  event 
list  at  the  time  of  each  event’s  insertion,  is  maximized  when  the  maximum  number  of 
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events  are  scheduled  before  any  are  removed  and  simulated.  In  such  a  case,  Ij  =  0, 


1/2  =  1,  and,  in  general,  —  e  —  1.  The  total  insertion  time  for  all  events,  7/,  can 
then  be  expressed  in  asymptotic  notation  as  follows; 

T,= 

=  0(((£  -  1)  £;)/2) 

=  0((E^-E)I2) 

=  0(£’) 


The  total  execution  time  of  the  sequential  algorithm,  T,  is: 

•  T  =  4-  7’r(e)  +  Tu{e))  +  TV  which  can  be  expressed  in  asymptotic  no¬ 

tation  as: 


=  i;f=,(0(l)  +  0(l)  +  0(l))  +  T, 

=  £0(1)  +  O(E^) 

=  0(E)  +  O(E^) 

=  O(E^) 

Theoretical  speed-up  of  the  distributed  event  list  algorithm,  with  sequential 
and  distributed  algorithms  using  a  linear  list,  and  with  the  same  idealistic  assump¬ 
tions  made  previously,  is  then:  0{E^) f 0{E^ I N^)  =  N'^ 
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As  before,  this  value  is  the  theoretical  best  speed-up,  and  is  not,  in  general, 
attainable,  due  to  overhead  and  communications  costs,  and  the  fact  that  it  is  based 
upon  the  “worst  case”  time  complexity  of  the  sequential  algorithm.  Other  event  list 
implementations  of  lower  time  complexity  will  have  correspondingly  lower  theoret  ical 
speed-ups  -  although  real  execution  time  will  decrease  in  comparison  to  the  linear 
list  implementation  as  the  average  L  increases.  Attained  speed-up  will  decrease  with 
increased  overhead  for  the  distributed  simulation,  naturally,  and  also  in  those  cases 
where  the  linear  components  of  the  execution  time,  such  as  event  calculation  time, 
overwhelm  the  relative  advantage  in  event  insertion  time  gained  by  distributing  the 
event  list. 


4-2  Empirical  Studies 

Empirical  studies  were  conducted  to  gauge  the  performance  of  the  variants  of 
the  distributed  event  list  algorithm  under  a  variety  of  conditions  In  these  studies,  a 
family  of  queueing  models  with  high  degrees  of  inherent  parallelism  were  simulated 
using  each  variant  of  the  distributed  algorithm.  The  results  demonstrate  the  rela¬ 
tionship  between  speed-up  and  certain  characteristics  of  the  simulation,  and  provide 
insights  into  the  most  effective  strategies  for  transmission  of  Null  messages  under 
varying  conditions,  as  well  as  yielding  some  elementary  heuristics  for  mapping  a 
given  simulation  model  into  a  logical  system. 
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4.2.1  Methodology  Empirical  performance  studies  were  conducted  on  the  In¬ 
tel  iPSC-do  hypercube  systems  at  AFIT.  Synthetic  simulation  models  were  con¬ 
structed,  and  executed  under  controlled  conditions. 

4. 2.1. 1  Simulation  Workload  Model  The  models  used  as  the  simulation 
workload  in  the  empirical  studies  are  queueing  models  consisting  of  32  replications 
of  a  homogeneous  sub-model,  connected  in  various  topologies.  The  basic  sub-model, 
proposed  by  Prof.  VV.  Shaw  at  AFIT,  ensures  a  model  with  a  high  degree  of  inherent 
parallelism,  with  the  additional  properties  of  independent  (entity  creation)  events 
at  each  sub-model  and  a  balanced  flow  of  messages  between  sub-models.  The  basic 
sub- model  consists  of  two  single-server  queue  PP’s  in  tandem,  with  entity  creation 
and  termination  processes,  connected  in  a  feedback  loop  as  shown  in  Figure  4.1. 

A  balanced  flow  of  entities  between  sub-models  is  achieved  by  controlling  arrival 
and  service  rates,  and  routing  probabilities.  Exponential  inter-arrival  times  with  a 
mean  of  1000  time  units  are  used  for  the  creation  process  at  each  sub-model,  except 
where  a  sub-model  is  a  “source”  process  with  no  entities  arriving  from  any  other 
sub-model.  In  that  case,  an  exponential  inte»'- arrival  time  of  mean  1111.1  is  used. 
Service  times  for  all  sub-models  are  biased  exponential  with  a  mean  of  100  time 
units  and  a  bias  of  1  time  unit.  The  bias  is  introduced  to  guarantee  a  non-zero 
service  time,  thus  enabling  lookahead  predictions  to  be  made.  Figure  4.1  depicts  the 
workload  sub-model  in  queueing  symbology. 
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CREATE 


Figure  4.1.  Balanced  Flow  Queueing  Sub-model 


Complex  simulation  models  are  created  by  connecting  32  of  the  basic  sub¬ 
models  in  a  topology  such  that  an  entity  exiting  one  sub-model  either  enters  a 
connected  sub-model  (in  the  form  of  an  event  message  sent  between  LP’s)  or  is 
terminated.  A  probability  is  associated  with  each  possible  transition  of  an  entity 
between  connected  sub-models.  In  the  empirical  studies,  logical  systems  of  fewer 
than  32  LP’s  were  constructed  by  grouping  the  basic  sub-models  to  form  larger  sub¬ 
models  in  accordance  with  the  assignment  criteria  in  use.  Note  that  it  is  possible 
to  assign  portions  of  a  basic  sub-model  to  different  LP’s.  This  was  not  done  in  the 
experiments,  however,  because  of  the  tightly  coupled  nature  of  the  basic  sub-model, 


as  well  as  the  convenience  of  manipulating  sub-models  instead  of  individual  PP’s. 


4-2. 1.2  Experimental  Environment  All  experiments  were  conducted  on 
the  Air  Force  Institute  of  Technology’s  two  Intel  iPSC-d5  Hypercube  multiprocessor 
systems.  The  iPSC  hypercube  consists  of  32  homogeneous  processor  nodes,  each  node 
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based  on  an  Intel  80286  CPU.  The  nodes  communicate  exclusively  through  message¬ 
passing  (no  shared  memory),  and  are  connected  to  each  other  in  a  5-dimen Jonal 
hypercube  topology  with  10  Mbit/s  point-to-point  serial  communications  channels. 
Nodes  are  also  connected  via  global  Ethernet  channel  to  the  Cube  Manager,  an  Intel 
System  286/310  microcomputer  [Int86b]. 

Due  to  the  lack  of  a  blocking  send  communication  protocol  in  the  iPSC  node 
operating  system  [lnt86a],  the  performa-  ce  effects  of  limited  buffer  size  were  not 
considered.  The  existing  non-blocking  send  protocol  was  used,  and  can  be  viewed 
as  equivalent  to  a  blocking  send  implementation  in  which  buffer  sizes  are  sufficiently 
large  to  eliminate  the  need  for  any  LP  to  block  while  sending  (for  those  simulations 
performed).  The  distributed  algorithm  was  implemented  as  if  a  blocking  send  was  in 
use,  with  Null  messages  sent  under  conditions  sufficient  to  avoid  the  acyclic  deadlock 
possibility  raised  by  a  blocking  send  protocol.  Because  unconstrained  buffer  size  is 
the  most  general  case  of  a  system  of  communicating  processes  and  these  empiri¬ 
cal  studies  do  not  claim  that  the  achieved  speed-ups  are  attainable  in  every  case, 
the  performance  effects  of  constraining  the  buffer  size  (and  resulting  implications 
for  assignment  heuristics  and  Null  message  strategies)  are  left  as  topics  for  future 
research. 

The  implementation  of  the  distributed  event  list  algorithm  is  a  modification 
of  a  package  of  subroutines  for  performing  distributed  discrete-event  simulation  de¬ 
veloped  in  Ryan-McFarland  (TM)  FORTRAN  by  Prof.  B.  Donlan  at  AFIT.  This 
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implementation  provides  an  environment  for  distributed  simulation  similar  to  early 
FORTRAN-based  simulation  languages  such  as  GASP  [Pri74].  In  addition  to  the 
statistics  collection  and  other  simulation  functions  generally  provided  by  such  lan¬ 
guages,  synchronization  and  message- passing  for  distributed  simulation  is  provided 
in  a  manner  nearly  transparent  to  the  user.  More  information  on  this  implementa¬ 
tion  is  available  from;  AFIT/ENG,  VVright-Patterson  AFB,  Ohio  45433,  (Attn;  Dr 
T.  Hartrum). 

The  e.vponential  random  variates  used  were  derived  from  (0,l)Uniform  pseudo¬ 
random  numbers  using  the  simplified  inverse  transform  method  [BC84].  The  gener¬ 
ation  of  uniform  random  numbers  was  accomplished  using  the  portable  FORTRAN 
multiplicative  congruential  generator  proposed  by  Schrage,  with  modulus  2^’  —  1  and 
multiplier  16807.  The  seed  values  for  random  number  generation  were  taken  from 
pp.  212-3  of  Bratley,  Fox,  and  Schrage,  and  are  reported  to  provide  non-overlapping 
sequences  of  length  131,072  [BFS83|.  Independent  random  number  streams  were 
used  for  events  at  each  sub-model,  so  that  the  simulation  results  for  a  particular 
model  did  not  vary  over  different  distributed  and  sequential  implementations. 

Empirical  testing  of  the  (0,l)Uniform  random  number  streams  was  accom¬ 
plished  to  verify  that  the  values  generated  were  indeed  uniform  and  independent.  A 
run  length  of  5000  numbers  was  tested  for  each  of  the  random  number  streams  that 
was  used.  To  test  the  independence  of  the  values  within  each  stream,  the  up-down 
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runs  test  was  performed  with  a  =  .05.  To  test  for  uniformity,  the  Chi-Square  test 
was  performed  with  q  =  .05  and  k  =  100. 

The  sequential  simulations  used  to  calculate  speed-up  factors  were  each  exe¬ 
cuted  on  a  single  processing  node  of  one  of  the  iPSC  hypercubes,  to  ensure  homoge¬ 
neous  processor  speed  among  observations.  The  algorithm  for  sequential  simulation 
used  the  same  event-list  implementation  as  the  distributed  event  list  algorithm,  with 
all  overhead  associated  with  message-passing  and  distributed  simulation  removed. 

4 -2. 1:3  Performance  Measurement  and  Instrumentation  All  experimen¬ 
tal  runs  were  accomplished  on  all  or  some  subset  of  the  32  processor  nodes.  To  initi¬ 
ate  a  run,  an  identical  process  is  loaded  and  started  on  each  node  of  the  hypercube, 
and  a  configuration  message  is  subsequently  sent  from  the  Cube  Manager  to  each 
node  in  the  executing  set  of  processes  to  inform  it  of  its  assigned  sub-model  of  the 
simulation. 

When  all  participating  processes  have  responded  with  a  “configuration  re¬ 
ceived”  mc'T''g<^,  a  global  “start  simulation”  message  is  sent.  Processes  not  par¬ 
ticipating  in  the  simulation  remain  in  a  loop,  waiting  for  a  configuration  message 
to  arrive.  As  each  node  completes  its  simulation,  a  “simulation  complete”  mes¬ 
sage  is  sent  to  the  Cube  Manager.  When  completion  messages  are  received  from 
all  executing  nodes,  simulation  statistics  are  collected  from  each  node.  During  the 
experimental  runs,  no  other  processes  were  executed  on  the  processing  nodes. 
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Included  in  the  simulation  execution  times,  in  both  distributed  and  sequential 
algorithms,  is  the  collection  of  certain  simulation  statistics.  Population  data  was 
gathered  on  inter-arrival  and  service  times,  and  the  time-in-system  of  all  entities 
in  the  queueing  model.  Data  was  collected  on  the  length  of  each  queue  in  the 
simulation  model,  including  the  event  list.  Time  data  was  also  collected  for  server 
utilization.  Collection  of  data  to  calculate  statistics  such  as  these  can  be  considered 
an  integral  part  of  any  simulation,  and  was  therefore  included  in  the  c.xecution  time 
of  all  simulation  models,  while  the  calculation  of  statistics  from  the  data  was  not. 
In  addition,  the  numbers  of  Null  and  event  messages  received  and  sent  at  each  LP 
were  accumulated  in  the  distributed  simulation  runs.  The  additional  overhead  of 
this  collection,  not  reflected  in  the  equivalent  sequential  simulation,  was  found  to  be 
negligible,  and  was  therefore  ignored. 

Empirical  observations  were  made  for  simulation  runs  of  1000000  time  units, 
to  allow  the  underlying  simulation  to  reach  steady  state  conditions.  All  simulation 
runs  were  triple-replicated,  and  the  mean  taken  as  the  observed  value. 

4-2.2  Empirical  Results 

4-2.2. 1  Effects  of  Topology  and  Spin  Loop  The  topology  of  connections 
among  the  LP’s  of  a  distributed  simulation  ha^  been  identified  as  a  signilicant  de¬ 
terminant  of  performance  in  performance  studies  of  the  Chandy-Misra  algorithm 
[PWM79.  RM88].  In  response  to  this,  the  performance  of  the  distributed  event  list 
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Figuie  4.2.  Network  of  Sub-models  in  Tandem  Topology 

algorithm  was  investigated  for  simulation  networks  of  several  topologies.  The  most 
basic  of  these  is  the  tandem  topology  (See  Figure  4.2),  consisting  of  a  set  of  M  siil)- 
models  w  =  1,  -  ■  •  ,M  with  probability  1.0  of  entity  routing  from  to 

e.xcept  for  S,\i.  which  routes  outgoing  entities  into  a  termination  process.  Witli  no 
possibility  of  deadlock,  the  tandem  model  provides  little  challenge  to  a  distributed 
simulation  algorithm,  and  satisfactory  performance  for  tandem  networks  can  be  con¬ 
sidered  a  "feasibility  test'’  for  a  distributed  simulation  algorithm  [RMM88]. 

Tandem  topologies  are  expected  to  perform  well  in  distributed  simulation,  so 
it  is  not  surprising  that  significant  speed-ups  have  been  realized  by  applying  the 
distributed  event  list  algorithm  to  tandem  networks.  The  attained  speed-up  factors, 
however,  were  somewhat  higher  than  the  number  of  processors  in  most  cases.  This 
exceeds  what  had  previously  been  thought  to  be  the  theoretical  bound  on  speed-up 
of  a  distributed  simulation.  These  super-linear  speed-ups  are  evidence  of  the  super- 
linear  effects  of  the  event  list  implementation  as  described  in  Section  4.1.  Figure  4. .4 
presents  the  speed-up  factors  achieved  for  the  tandem  topology. 

.-\  more  complex  family  of  topologies  is  the  class  of  feed-forward  topologies, 
in  wl.ich  entities  exiting  a  sub-model  can  be  routed  to  one  of  several  sub-morlels. 
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Figure  4.3.  Speed-Up  for  Tandem  Topology 
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with  the  restriction  that  there  exist  no  directed  cycles  in  the  sub-model  connec¬ 
tion  graph.  Feed-forward  networks  often  contain  sets  of  sub-models  in  the  fork-join 
configuration  that  may  cause  acyclic  deadlock  to  occur.  Two  versions  of  a  gener¬ 
alized  feed-forward  topology  with  a  high  degree  of  branching,  as  shown  in  Figure 
4.4,  were  investigated.  A  "balanced”  version  of  the  feed-forward  topology  was  eval¬ 
uated,  in  which  the  entities  emanating  from  a  sub-model  have  an  equal  probability 
of  branching  to  a  given  connected  sub-model.  In  addition,  an  “unbalanced”  version 
was  investigated,  in  which  an  arbitrarily  chosen  path  of  each  multiple  branch  was  as¬ 
signed  a  routing  probability  of  .01,  with  the  remaining  .99  probability  di\  ’ded  e\'enl\' 
over  the  remaining  paths. 

In  performance  studies  of  the  Chandy-Misra  algorithm,  the  introduction  of 
feed-forward  branching  in  the  logical  system  has  been  shown  to  have  an  adverse 
impact  on  speed-up  [RMM88].  Those  results  are  contradicted  by  the  performance  of 
the  distributed  event  list  algorithm  in  feed-forward  networks  (See  Figure  4.5).  Both 
balanced  and  unbalanced  feed-forward  models  achieved  speed-ups  consistently  equal 
to  or  better  than  those  observed  in  the  tandem  model. 

The  superior  performance  achieved  for  models  with  feed-forward  topologies  is 
partially  attributable  to  the  nature  of  the  simulation  workload  model  in  use.  A 
high  ratio  of  internal  events  to  outside  communications  in  each  sub-model  make  the 
simulation  particularly  amenable  to  a  two-dimensional  decomposition,  and  causes 
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Figure  4.5.  Speed-Up  for  Feed-Forward  Topology 
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performance  to  be  relatively  insensitive  to  the  fairly  large  numbers  of  Null  messages 
transmitted  in  these  feed-forward  networks. 

Another  common  topological  class  of  simulation  networks  are  those  in  which 
directed  cycles  or  “feedback  loops”  appear  in  the  sub-model  connection  graph.  Feed¬ 
back  loops  in  a  simulation  topology  test  the  cyclic  deadlock  avoidance  mechanism 
of  the  distributed  simulator.  Performance  of  the  Chandy-Misra  Null  Message  algo¬ 
rithm  has  been  shown  to  react  negatively  to  the  presence  of  feedback  loops  in  the  LP 
connection  graph  [PWM79,  CM81].  This  property  is  also  present  in  the  distributed 
event  list  algorithm  with  Null  messages.  Introducing  a  feedback  path  over  a  set  of 
LP’s  appears  to  cause  a  time-driven  effect  as  described  in  Chapter  2,  in  which  each 
LP  is  dependent  on  another  LP  for  time  advance,  the  net  result  being  that  all  LP’s 
are  constrained  by  the  slowest  LP  in  the  cycle. 

This  effect  was  demonstrated  by  adding  a  “pseudo”  feedback  loop  from  the 
last  LP  to  the  first  LP  in  a  tandem  logical  system.  The  underlying  simulation  model 
was  not  changed,  however,  so  that  the  there  was  no  actual  probability  of  entities 
feeding  back.  The  tandem  model  executed  as  it  normally  would,  but  the  logical 
system  synchronized  time  advance  as  if  a  back-to-front  feedback  loop  existed  (see 
Figure  4.6).  Differences  in  distributed  execution  time  from  the  tandem  model  are 
then  wholly  attributable  to  the  added  synchronization  and  communication  overhead 
of  the  pseudo-feedback,  and  execution  times  can  therefore  be  directly  compared.  The 
tandem  and  pseudo-feedback  models  were  executed  on  32  processors  (with  identical 
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Figure  4.6.  Tandem  Logical  System  with  Pseudo-Feedback 


Topology 

Execution  Time  (s) 

^  of  Null  Messages 

Tandem 

36.0 

6422 

Pseudo-feedback  (1  Loop) 

3601.3 

1.9  X  10® 

Pseudo-feedback  (2  Loops) 

3729.2 

3.0  X  10® 

Pseudo-feedback  (3  Loops) 

3963.1 

o 

X 

Pseudo-feedback  (4  Loops) 

4361.48 

7.0  X  10® 

Table  4.2.  Effect  of  Pseudo-Feedback  Loops  on  Tandem  Model 


processor  assignment).  The  introduction  of  a  pseudo-feedback  loop  in  the  tandem 
model  resulted  in  a  hundred-fold  increase  in  execution  time  and  an  “explosion”  of 


Null  message  traffic.  Additional  pseudo-feedback  loops  nested  inside  of  the  outermost 
loop  resulted  in  further  significant  increases  in  execution  time  and  the  number  of  Null 
messages  transmitted.  These  results  are  shown  in  Table  4.2. 


The  effects  of  a  computational  workload  applied  to  each  event  of  the  simulation 
were  evaluated.  A  computational  workload,  also  known  as  the  spin  loop  [Fuj88],  was 
associated  with  each  event  in  the  distributed  and  sequential  simulations.  The  spin 
loop  is  a  totally  artificial  computational  load;  in  this  case  each  spin  loop  consisted 
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of  two  multiply  operations  and  a  divide  operation.  Experiments  were  conducted  for 
spin  loops  of  0  and  1000,  for  tandem  and  feed-forward  topologies  distributed  over 
2‘,  i  =  1,  •  •  •  ,  5  processors. 

Adding  a  spin  loop  to  the  computation  of  each  event  had  several  effects.  In 
a  simulation  with  a  low  ratio  of  internal  events  to  event  messages  at  each  LP, 
the  increased  amount  of  work  done  in  relation  to  the  communications  overhead 
would  tend  to  increase  speed-up.  In  distributed  simulations  with  a  high  internal 
event/communications  ratio  as  displayed  by  the  models  in  these  empirical  studies, 
other  components  of  execution  time  are  dominant  over  communications  overhead.  In 
the  distributed  event  list  algorithm,  increaised  spin  loop  also  causes  the  list  insertion 
time  to  be  a  relatively  less-important  component  of  execution  time.  The  constant 
load  per  event  can  overwhelm  the  super-linear  advantage  gained  by  distributing  the 
event  list. 

The  latter  effect  can  be  expected  to  be  dominant  in  the  models  presented 
above,  since  achieved  speed-ups  have  been  consistently  super-linear.  This  does  turn 
out  to  be  the  case  for  both  tandem  and  feed-forward  networks,  as  seen  in  Figure 
4.7  (only  balanced  feed-forward  is  shown;  unbalanced  feed-forward  results  were  very 
similar). 

4-2. 2. 2  Null  Message  Strategies  The  relative  effectiveness  of  the  variant 
strategies  for  transmitting  Null  messages  was  evaluated  in  a  series  of  experiments. 
Two  variants,  the  use  of  a  Null  Message  Time-out  and  the  addition  of  Stimulus  Nulls, 
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Topology 

Spin  Loop 

Number  of  Nodes 

Null  Messages 

tandem 

standard 

feed-  for  ward  ( balanced ) 

0 

2' 

w/  Time-out 

i  =  l,---,5 

feed-  forward  ( unbalanced ) 

1000 

w/  Stimulus  Nulls 

Table  4.3.  Summary  of  Factors  for  Null  Message  Strategy 


both  described  iu  Chapter  3,  were  evaluated  in  comparison  with  the  “standard"’ 
strategy  for  Null  message  transmission.  Comparisons  were  made  for  networks  of 
various  topologies,  for  numbers  of  processors  2*^,  c/  =  1,  •  •  • ,  5,  and  with  the  addition 
of  spin  loop.  A  summary  of  the  experimental  factors  is  presented  in  Table  4.3.  Note 
that  no  feedback  topologies  were  evaluated.  Preliminary  analyses  demonstrated 
that  both  Null  message  variants  caused  the  number  of  Null  messages  generated  by 
feedback  models,  already  excessive,  to  further  incre2ise.  This  often  led  to  buffer 
saturation  and  the  abnormal  termination  of  the  simulation,  as  a  result  of  necessaiy 
implementation  compromises  that  had  been  made  (discussed  in  Section  4.2.2).  It 
was  decided  to  orient  the  research  toward  other,  more  productive  areas. 


One  variant  method  for  Null  message  transmission  is  the  Null  message  with 
Time-out  algorithm  described  in  Section  3.2. 6. 2.  Recall  that  this  algorithm  variant 
sends  Null  messages  after  a  specified  amount  of  real  clock  time  has  passed  without 
a  simulation  time  advance.  In  the  experiments,  the  standard  Null  message  strategy 
was  compared  with  the  Time-out  algorithm  for  time-out  values  of  25  ms,  100ms. 
1000ms,  and  lO'’  ms. 
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In  the  case  of  the  tandem  model,  the  Time-out  strategy  was  found  to  luirt 
performance  in  almost  every  instance.  This  is  somewhat  intuitive,  since  a  tandem 
model  with  a  balanced  workload  sends  relatively  few  Null  messages,  so  there  is  little 
chance  of  excessive  Null  message  overhead.  (In  addition,  the  high  internal  /external 
event  ratio  of  the  workload  model  would  appear  to  benefit  from  a  certain  amount  of 
Null  message  traffic.)  The  negative  effect  of  Time-out  was  most  pronounced  for  the 
case  of  0  spin  loop, as  can  be  seen  in  Figure  4.8.  In  these  observations,  increasing 
the  Time-out  v'alue  always  decreased  the  speed-up.  The  decrease  in  speed-up  was 
not  proportional  to  the  Time-out  value,  but  roughly  proportional  to  the  number 
of  Null  messages  eliminated  by  the  Time-out  (the  limit,  of  course,  being  when  no 
Null  messages  are  sent).  This  effect  is  consistent  with  the  relative  insensitivity  to 
Time-out  of  the  simulations  distributed  over  few  nodes,  since  those  highly-utilized 
processes  send  .Null  messages  infrequently  in  the  tandem  model. 

Because  of  the  decreased  ratio  of  the  Time-out  values  to  event  processing  times, 
Time-outs  had  less  effect  when  a  spin  loop  of  1000  was  introduced.  The  effect  of  the 
Time-out  was  still  negative  in  each  case,  however. 

Feed-forward  models  were  evaluated  with  the  time-out  algorithm,  and  also 
suffered  decreased  performance  in  comparison  to  the  standard  method.  As  in  the 
tandem  case,  the  decrease  in  speed-up  was  consistent  with  increased  time-out,  with 
the  effect  proportional  to  the  number  of  Null  messages  eliminated  by  each  Time¬ 
out.  The  negative  effect  of  Time-out  on  the  unbalanced  feed-forward  model  was 
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Figure  4.8.  Tandem  Speed-Up,  Time-out  Nulls,  0  Spin  Loop 
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Figure  4.9.  Balanced  Feed-Forward  Speed-Up,  Time-out  Nulls,  0  Spin  Loop 

noticeably  more  than  that  in  the  balanced  case,  which  wais  relatively  unaffected  (See 
Figure  4.9). 

The  poor  performance  of  the  Time-out  algorithm  is  also  partly  attributable 
to  high  internal  event /communications  ratio  of  the  workload  model.  More  positive 
results  would  be  expected  in  applying  Time-out  to  simulations  that  are  “communi¬ 


cations  bound,”  especially  those  in  networks  with  high  degrees  of  branching. 
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As  the  sending  of  extra  Stimulus  Null  messages  (as  described  in  Section  3.2.().l ) 
is.  in  some  sense,  the  "complement”  of  the  Time-out,  the  Stimulus  Null  method 
would  appear  to  have  potential,  given  the  failure  of  the  Time-out  algorithm  that  has 
been  observed.  Experiments  were  performed  in  which  Stimulus  Nulls  were  transmit¬ 
ted  over  the  outgoing  message  channels  of  each  LP  in  conjunction  with  10%,  1%. 
and  0.1%  of  the  events  simulated  at  that  LP. 

Results  of  these  experiments  show  that  slight  increases  in  speed-up  are  regularly 
achievable  with  Stimulus  Nulls,  across  all  topologies  and  spin  loop  values.  This  was 
typified  by  the  performance  of  the  balanced  feed-forward  case  with  1000  spin  loop, 
shown  if  Figure  4.10. 

An  interesting  observation  is  that  of  feed-forward  networks  with  0  spin  loop. 
Here  both  balanced  and  unbalanced  models  showed  a  decrease  in  speed-up  at  10 
to  32  nodes  a.s  10%  Stimulus  Nulls  were  applied.  For  these  models,  lower  levels 
of  Stimulus  Nulls  had  negligible  effect,  as  seen  in  Figure  4.11.  Because  the  feed¬ 
forward  models  with  0  spin  loop  have  the  highest  Null  message  traffic  to  begin  with, 
the  appearance  is  that  of  a  “threshold”  for  Null  messages,  up  to  which  performance 
improves,  but  after  which  performance  begins  to  drop  off  due  to  the  overhead  of 
reading  and  sending  extraneous  Null  messages. 

The  assignments  of  simulation  models  to  logical  processes  in  the  above  exp('r- 
iments  were  cho'Jen  so  as  to  maintain  a  balanced  workload  among  logical  process('s. 
Identical  assignu'.cnts  were  used  across  all  observations.  The  logical  process  as- 
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Figure  4.10.  Balanced  Feed-Forward  Speed-Up,  Stimulus  Nulls,  1000  Spin  Loop 
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Figure  4.11.  Unbalanced  Feed-Forward  Speed-Up,  Stimulus  Nulls,  0  Spin  Loop 
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signment  did  not  induce  feedback  loops  in  any  instance.  The  assignment  of  logical 
processes  to  processors  of  the  hypercube  was  done  to  minimize  communications  over- 
hca,d  in  the  tandem  case  and  variants  thereof.  The  processor  assignment  was  fi.xed 
for  all  runs  of  each  topology. 

4. 2. 2. 3  The  Assignment  Problem  The  assignment  of  processes  to  pro¬ 
cessors  is  a  classic  problem  in  distributed  processing.  Two  conflicting  factors,  com¬ 
munications  overhead  and  processing  workload  balance,  must  be  traded-off  in  order 
to  achieve  an  assignment  that  maximizes  throughput.  Methods  for  calculating  an 
optimal  assignment  for  a  distributed  application  are  generally  computationally  in¬ 
tractable,  however  [CH*80].  This  may  be  especially  applicable  in  the  area  of  sim¬ 
ulation,  where  the  communications  and  processing  loads  of  a  simulation  model  are 
not  well-known  a  priori  (or  else  closed  form  solutions,  rather  than  simulation,  could 
have  been  used  to  solve  the  application  problem).  It  is  then  necessary  to  resort 
to  heuristic  methods  for  determining  a  “good,”  rather  than  “best”  assignment  of 
processes  to  processors. 

In  the  distributed  event  list  algorithm,  the  way  in  which  the  logical  process 
encapsulates  the  simulation  sub-model  at  its  processor  adds  a  new  dimension  to  the 
assignment  problem.  The  monolithic  nature  of  the  LP  time  advance  mechanism 
makes  it  possible  for  non-cyclic  topologies  of  PP’s  to  be  mapped  into  cyclic  networks 
of  LP’s.  as  shown  in  Chapter  3.  Even  simple  tandem  networks  can  be  assigned  in 
such  a  way.  Given  the  poor  performance  demonstrated  by  the  distributed  event  list 
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Assignment 

Characteristics 

Balanced  Load 

Vertical  format;  No  feedback  introduced;  Equal  number 
of  PP’s  per  LP 

Pure  Vertical 

No  feedback  introduced;  Load  balance  not  considered; 
Logical  system  reduces  to  tandem 

Table  4.4.  Assignment  Strategies  for  Feed- Forward  Topology 


Assignment 

Execution  Time  (s) 

Event  Messages 

Balanced  Load 

103.2 

4288 

2394 

Pure  Vertical 

160.2 

4055 

3049 

Table  4.5.  Effect  of  Assignment,  Balanced  Feed-Forward,  8  Processors 

algorithm  in  association  with  feedback  loops  in  the  logical  system,  the  topology  of 
the  logical  system  must  be  considered  in  addition  to  the  “traditional”  assignment 
factors  of  workload  and  communication  overhead. 

Experiments  were  conducted  to  provide  insights  into  effective  strategies  for 
mapping  a  simulation  onto  a  given  number  of  logical  processes.  In  one  set  of  these 
experiments,  models  of  feed-forward  topology,  with  both  balanced  and  unbalanced 
routing,  were  assigned  to  a  logical  system  of  8  processors,  with  the  assignment  meth¬ 
ods  shown  in  Table  4.4. 


#  The  results  of  these  assignment  experiments  show  that  achieving  an  even  work¬ 

load  over  all  logical  processes  is  of  greater  importance  than  eliminating  branching, 
for  both  balanced  and  unbalanced  feed-forward  networks.  These  results  mesh  with 
earlier  findings  of  comparable  performance  for  tandem  and  feed-forward  topologies. 
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Assignment 

Execution  Time  (s) 

^  Event  Messages 

Balanced  Load 

103.5 

4456 

2307 

Pure  Vertical 

159.6 

4124 

3023 

Table  4.6.  Effect  of  Assignment,  Unbalanced  Feed-Forward,  8  Processors 


Figure  4.12.  Tandem  Topology  with  Single  Loop 
An  experiment  was  conducted  to  gauge  the  effectiveness  of  containing  an  ex¬ 
isting  feedback  loop  within  a  logical  process  at  the  expense  of  load  balancing.  A  new 
topology  was  introduced  for  this  experiment,  consisting  of  a  tandem  network  with 
a  single  feedback  loop  with  .01  routing  probability,  as  shown  in  Figure  4.12.  The 
configuration  of  this  topology  yields  the  worst  possible  load  balance  for  an  8  node 
assignment  if  the  feedback  loop  is  contained  within  a  logical  process.  The  assignment 
strategies  used  are  given  in  Table  4.7. 


Assignment 

Characteristics 

Balanced  Load 

Equal  number  of  PP’s  per  LP 

Loop  Contained 

Contain  feedback  loop  within  a  single  LP;  Load  balance 
not  considered 

Table  4.7.  Assignment  Strategies  for  Tandem  Topology  w/  Single  Loop 


r 
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Assignment 

E.xecution  Time  (s) 

^  Null  Messages 

#  Event  Messages 

Balanced  Load 

2997.44 

X 

o 

685 

Contain  Loop 

650.1 

1190 

655 

Table  4.8.  Effect  of  Assignment,  Tandem  with  Single  Feedback,  8  Processors 

It  is  apparent  from  the  Single  Loop  model  that  the  elimination  of  foedliack 
loops  can  play  a  more  important  role  than  load  balance  in  the  performance  of  the 
distributed  event  list  algorithm.  As  shown  in  Table  4.8,  containing  the  feedback  loop 


at  the  cost  of  an  unbalanced  computational  workload  can  improve  e.xecution  time 
by  a  large  amount. 


The  preceding  assignment  experiments  suggest  the  essential  relationships  be¬ 
tween  the  factors  of  topology  and  workload  balance,  and  so  may  be  used  as  starting 
points  for  heuristic  solutions  to  the  assignment  problem.  Finding  effective  heuristics 
•  for  the  assignment  problem  is  a  major  hurdle  to  be  overcome  before  the  distributed 

event  list  algorithm  can  be  considered  widely  applicable. 


j^.3  Summary 

A  linear  event  list  implementation  for  distributed  simulation  has  been  shown 
to  be  0{E^),  for  E  events  simulated.  Time  complexity  of  event  list  operations  of 
greater  than  0{E)  has  been  shown  to  imply  theoretical  speed-up  factors  of  greater 
than  N  for  a  simulation  distributed  over  N  processors.  This  result  contradicts  a 
commonly  held  view  in  the  literature,  which  asserts  the  existence  of  a  bound  of  N  on 
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attainable  speed-up.  It  has  been  shown  that  a  speed-up  bound  of  N  is  unjustified,  as 
it  ignores  the  time  complexity  of  the  event  list  overhead  in  the  sequential  simulations 
used  for  comparison. 

Empirical  studies  have  been  conducted  to  evaluate  the  performance  of  the 
distributed  event  list  algorithm  under  a  variety  of  conditions.  Speed-up  greater 
than  N  was  achieved  for  certain  topologies  of  simulation  models,  confirming  the 
above  time  complexity  analysis.  The  topology  of  the  simulation  model  was  shown 
to  greatly  affect  the  attained  speed-up.  Simulation  networks  with  directed  cycles  or 
“feedback  loops’’  were  shown  to  exhibit  extremely  poor  performance,  in  agreement 
with  previous  performance  studies  of  the  Chandy-Misra  algorithm  [RMM88].  Feed¬ 
forward  branching  topologies,  which  showed  poor  performance  relative  to  the  tandem 
model  in  previous  studies  (on  shared-memory  machines),  actually  performed  better 
than  the  tandem  case  in  these  experiments.  This  was  attributed  to  the  high  inherent 
parallelism  and  low  communications  ratio  in  the  models  studied. 

A  high  computational  workload  associated  with  each  event  was  shown  to  lower 
the  attained  speed-up  below  N.  This  contradicted  Fujimoto’s  results,  in  that  the 
higher  computation/communications  ratio  induced  by  the  increased  workload  is  ex¬ 
pected  to  improve  speed-up  [Fu  j88].  This  conflict  was  explained  as  a  “dampening”  of 
the  super-linear  effects  of  distributing  the  event  list  with  a  constant  time  component 
associated  with  each  event. 


4-34 


Alternate  strategies  for  sending  the  Null  messages  used  for  deadlock  avoidance 
were  compared.  Results  showed  that  for  tandem  and  feed-forward  topologies,  a 
certain  level  of  Null  messages  were  beneficial  to  speed-up.  It  was  also  seen  that  a 
threshold  exists,  above  which  additional  Null  messages  are  unnecessary  overhead. 
No  strategy  that  was  evaluated  was  shown  to  improve  the  poor  performance  of 
simulation  topologies  with  feedback. 

The  problem  of  assigning  a  given  simulation  model  to  a  set  of  logical  processes 
was  addressed,  .\gain.  it  was  seen  that  topology  played  a  critical  role  in  the  ef¬ 
fectiveness  of  an  assignment  strategy.  Avoiding  feedback  loops  was  shown  to  be  of 
greater  importance  than  “traditio.  il”  assignment  considerations  such  as  balancing 
the  processing  load  of  the  logical  system. 
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V.  Conclusions 


5.1  Summary 

An  algorithm  for  distributed  discrete-event  simulation,  the  distributed  e\’eiit 
list  algorithm,  has  been  described  in  this  thesis.  This  algorithm  uses  an  event  list  to 
order  events  at  each  logical  process,  and  uses  a  variant  of  the  Chand}'-Misra  algorithm 
for  inter-process  communication,  with  a  prediction  function  added.  Null  messages 
are  used  to  avoid  process  deadlock,  as  in  the  original  Chandy-Misra  algorithm.  The 
distributed  event  list  algorithm  has  been  shown  to  require  a  bounded  amount  of 
memory  at  each  logical  process. 

A  study  of  event  list  implementations  shows  that  the  theoretical  speed-up 
factor  for  the  distributed  event  list  algorithm  is  in  excess  of  N,  for  a  simulation 
distributed  over  N  processors,  where  both  sequential  and  distributed  simulation  event 
lists  are  implemented  with  a  linear  list.  More  efficient  event  list  implementations 
yield  lower  theoretical  speed-ups,  but  can  still  exceed  N,  which  had  previously  been 
thought  to  be  the  optimum  speed-up. 

Empirical  studies  show  that  speed-up  values  greater  than  N  can  be  regularly 
achieved  for  simulations  with  high  degrees  of  parallelism  and  topologies  without 
feedback,  if  the  computational  workload  per  event  was  small  in  comparison  to  the 
list  insertion  time.  Topology  is  shown  to  be  a  prime  factor  in  determining  attainable 
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speed-up  of  a  given  simulation,  with  the  presence  of  feedback  loops  playing  a  critical 
role. 


5.1.1  Null  Message  Strategies  Several  Null  message  strategies  were  evaluated 
in  addition  to  the  basic  deadlock  avoidance  strategy.  The  strategy  of  sending  extra 
"stimulus’’  Nulls  was  found  to  be  marginally  more  effective  than  the  basic  strategy 
in  tandem  networks,  and  in  cases  where  there  was  a  large  computational  load  or 
"spin  loop”  associated  with  each  event. 

The  strategy  of  sending  Null  messages  during  the  LP  Read  Phase  with  a  time¬ 
out  between  sends  did  not  prove  to  be  effective.  Indeed,  this  strategy,  an  attempt 
to  reduce  the  number  of  “unnecessary”  Null  messages,  Wtis  found  to  provide  worse 
performance  than  the  basic  algorithm  in  every  case.  In  networks  with  feedback 
loops,  the  time-out  strategy  had  the  effect  opposite  to  that  intended,  causing  an 
“avalanche”  of  Null  messages  that  virtually  paralyzed  the  logical  system,  leading  to 
buffer  saturation  in  most  cases. 

The  most  significant  result  of  studying  the  variant  Null  message  strategies  was 
that  no  strategy  was  found  to  improve  the  poor  performance  of  the  distributed  event 
list  algorithm  in  the  presence  of  feedback  loops  in  the  logical  system. 

5.1.2  Assignment  Heuristics  Empirical  studies  have  yielded  some  elemen¬ 
tary  heuristics  for  the  assignment  of  physical  to  logical  processes  when  using  the 
distributed  event  list  algorithm.  The  following  basic  rules  can  be  used  for  distribut- 
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ing  a  simulation  over  a  given  number  of  logical  processes.  The  rules  are  listed  in 
descending  order  of  importance: 

1.  Avoid  introducing  feedback  loops  due  to  assignment  of  the  simulation  model 
to  the  logical  system. 

2.  Contain  e.Kisting  feedback  loops  within  a  logical  process  whenever  possible. 

3.  Achieve  a  balanced  processing  load  among  the  logical  processes. 

4.  If  possible,  reduce  the  amount  of  branching  in  the  logical  system. 

5.2  Assessment  of  the  Distributed  Event  List  Algorithm 

The  distributed  event  list  algorithm  has  been  shown  to  provide  a  significant 
source  of  speed-up  for  discrete-event  simulation  in  many  circumstances.  There  are. 
however,  several  properties  of  the  algorithm  which  may  limit  its  applicability.  Per¬ 
ceived  advantages  and  disadvantages  of  the  distributed  event  list  algorithm  in  rela¬ 
tion  to  other  distributed  simulation  algorithms  are  enumerated  below. 

5.2.1  Advantages  of  the  Distributed  Event  List  Algorithm  Some  advantages 
of  the  distributed  event  list  algorithm  for  distributed  discrete  event  simulation  are 
as  follows: 

•  The  distributed  event  list  algorithm  can  provide  heretofore  unrealized  speed- 
ups  for  simulations  of  certain  topologies.  The  algorithm  can  be  highly  efficient, 
providing  significant  speed-up  with  few  processors. 
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•  The  algorithm  is  based  on  the  widely-used  event-oriented  view  of  simulation, 
and  therefore  requires  no  change  in  perspective  to  use,  as  do  some  other  dis¬ 
tributed  simulation  algorithms.  Parallelization  of  existing  sequential  simula¬ 
tion  models  should  then  be  comparatively  easy  with  the  distributed  event  list 
algorithm. 

•  The  use  of  a  single  logical  process  at  each  processor,  as  in  the  distributed  event 
list  algorithm,  facilitates  checkpointing  and  the  collection  of  statistics. 

5.2.2  Disadvantages  of  the  Distributed  Event  List  Algorithm  Some  perceived 
weaknesses  in  the  distributed  event  list  algorithm  are  the  following; 

•  Attainable  speed-up  is  highly  dependent  upon  the  absence  of  feedback  loops 
in  the  logical  system  topology,  as  in  the  original  Chandy-Misra  Null  message 
algorithm.  A  feedback  loop  in  the  simulated  system  can  be  negated  by  con¬ 
taining  it  inside  a  single  logical  process.  This  practice,  however,  limits  the 
number  of  processes  in  the  logical  system  to  the  maximum  number  of  disjoint 
non-cyclic  subgraphs  of  the  physical  process  connectivity  graph. 

•  The  distributed  event  list  algorithm  can  not  provide  a  deterministic  execution 
of  simultaneous  events  in  many  cases.  Because  each  (ogical  process  can  only 
predict  its  future  message  output  if  no  event  is  scheduled  at  the  current  simula¬ 
tion  time,  processes  are  required,  in  order  to  avoid  cyclic  deadlock,  to  simulate 
any  events  scheduled  for  the  current  simulation  time,  rather  than  waiting  for 
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all  possible  message  events  at  that  time  to  arrive.  Event  messages  with  the 
same  simulation  time  arriving  at  a  logical  process  from  two  different  processes 
are  simulated  in  the  arbitrary  order  of  their  arrival. 

•  The  tightly-coupled  nature  of  the  logical  process  complicates  the  assignment 
problem.  As  demonstrated  in  Chapter  3,  a  logical  system  may  have  a  radically 
different  topology  than  the  underlying  physical  system,  perhaps  introducing 
feedback  loops  where  none  exist  in  the  simulated  system.  The  aforementioned 
poor  performance  of  cyclic  systems  and  the  necessity  of  maintaining  logical 
system  predictability  (Chapter  3)  make  topology  an  important  assignment  con¬ 
sideration. 

5. .5  Recommendations  for  Further  Research 

Recommendations  for  additional  research  focus  on  ameliorating  some  of  per¬ 
ceived  weaknesses  of  the  distributed  event  list  algorithm.  These  weaknesses  keep  the 
distributed  event  list  algorithm  from  possessing  the  qualities  of  robustness  required 
for  a  generally-applicable  distributed  simulation  algorithm. 

A  variant  of  the  distributed  event  list  algorithm  was  developed  that  replacpc 
the  Null  message  algorithm  for  deadlock  avoidance  with  a.  Marker  algorithm  for 
deadlock  detection  and  recovery,  as  proposed  by  Misra  [Mis86].  This  variant  should 
be  evaluated  as  a  possible  remedy  for  the  poor  cyclic  performance  of  the  present, 
algorithm.  Improving  the  performance  of  the  distributed  event  list  algorithm  in 
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networks  with  feedback  would  ease  the  assignment  problem  as  well.  Other  possible 
algorithms  for  deadlock  detection  and  recovery  are  outlined  in  [CM81]. 

Comprehensive  assignment  heuristics  should  be  developed,  with  the  eventual 
goal  of  integration  into  an  automated  system  for  physical-to-logical  system  mapping. 
The  elementaiy  heuristics  provided  above,  such  as  avoiding  inducing  loops  in  the 
logical  system,  would  be  relatively  straightforward  to  automate.  More  challenging 
would  be  the  automation  of  in-depth  assignment  heuristics,  utilizing  information 
internal  to  the  physical  system  to  estimate  a  near-optimal  assignment  based  on  the 
estimated  computation  and  communications  load  for  a  given  simulation. 

A  grave  weakness  in  the  distributed  event  list  algorithm  is  its  lack  of  capacity 
in  dealing  with  simultaneous  events.  There  is  no  obvious  solution  to  this  problem. 
Unless  the  algorithm  can  be  modified  so  that  the  order  of  execution  of  simultaneous 
events  can  be  ascertained,  the  distributed  event  list  algorithm  will  remain  unsuitable 
for  those  applications,  such  as  digital  logic  simulation,  in  which  the  order  of  execution 
of  simultaneous  events  significantly  affects  the  resulting  system  state. 
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